Capstone Project on Unsupervised Learning¶

Context¶

The marketing team is planning next year's marketing strategies and they are willing to actualize their approach, considering the bad results they have had with their campaigns during the last years.

The new plan is based on implementing segmentation strategies, something that they have not done before.

The objectives:¶

Because segmentation is one of the main strategies utilized in marketing it has been proved to be effective, this is going to be the objective of this project: to provide information about the existing segments in the customer base, about their qualities, and what distinguishes them.

The key questions:¶

At this initial point of this project, there are many questions that come to mind. They are a mix of questions that need to be answered by the business, and others that can be answered by this data science project. Let's describe them all first, to have a clear picture of what is the exploration:

  • What is the main customer segment we have so far?
  • What are the qualities of this segment?
  • What is the second segment? How is this different from the main one?
  • Why are they the two main segments? What measures are we using for this judgment? (E.g., amount spent during a period, if they are repeated clients, etc.).
  • What are the best channels to reach these two main segments?
  • With what products?
  • What is the price point of the products that these segments purchase?

The problem formulation:¶

As a data scientist, the main challenge here is to find clear patterns and groups in the data, identifying their qualities and what makes them different from each other, which can later be interpreted as customer segments. Ideally, we will be able to find two groups that are the most different as possible between them. This knowledge can bring clarity to the business and marketing team about who their clients are and how to improve their strategies to better achieve their business goals. Data science offers a robust technique and perspective to help the business and marketing team with trustworthy information and knowledge based on data, so they can make decisions about their challenges. With our data science project, we can contribute with valuable, curated information about the business and actionable knowledge (in this case: customer profile and their behavior, existing segments). The nature of this data science problem is one of unsupervised learning because we do not know beforehand what the existing segments of this customer base are. We will benefit from applying clustering techniques and letting the data show what are they.


Data Dictionary¶


The dataset contains the following features:

  1. ID: Unique ID of each customer
  2. Year_Birth: Customer’s year of birth
  3. Education: Customer's level of education
  4. Marital_Status: Customer's marital status
  5. Kidhome: Number of small children in customer's household
  6. Teenhome: Number of teenagers in customer's household
  7. Income: Customer's yearly household income in USD
  8. Recency: Number of days since the last purchase
  9. Dt_Customer: Date of customer's enrollment with the company
  10. MntFishProducts: The amount spent on fish products in the last 2 years
  11. MntMeatProducts: The amount spent on meat products in the last 2 years
  12. MntFruits: The amount spent on fruits products in the last 2 years
  13. MntSweetProducts: Amount spent on sweet products in the last 2 years
  14. MntWines: The amount spent on wine products in the last 2 years
  15. MntGoldProds: The amount spent on gold products in the last 2 years
  16. NumDealsPurchases: Number of purchases made with discount
  17. NumCatalogPurchases: Number of purchases made using a catalog (buying goods to be shipped through the mail)
  18. NumStorePurchases: Number of purchases made directly in stores
  19. NumWebPurchases: Number of purchases made through the company's website
  20. NumWebVisitsMonth: Number of visits to the company's website in the last month
  21. AcceptedCmp1: 1 if customer accepted the offer in the first campaign, 0 otherwise
  22. AcceptedCmp2: 1 if customer accepted the offer in the second campaign, 0 otherwise
  23. AcceptedCmp3: 1 if customer accepted the offer in the third campaign, 0 otherwise
  24. AcceptedCmp4: 1 if customer accepted the offer in the fourth campaign, 0 otherwise
  25. AcceptedCmp5: 1 if customer accepted the offer in the fifth campaign, 0 otherwise
  26. Response: 1 if customer accepted the offer in the last campaign, 0 otherwise
  27. Complain: 1 If the customer complained in the last 2 years, 0 otherwise

Note: You can assume that the data is collected in the year 2016.

Milestone 1¶

Loading Libraries¶

In [1]:
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd
!pip install scikit-learn
!pip install scikit-learn-extra

# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# To scale the data using z-score
from sklearn.preprocessing import StandardScaler

# To compute distances
from scipy.spatial.distance import cdist

# To perform K-means clustering and compute Silhouette scores
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# To visualize the elbow curve and Silhouette scores
from yellowbrick.cluster import SilhouetteVisualizer

# Importing PCA
from sklearn.decomposition import PCA

# To encode the variable
from sklearn.preprocessing import LabelEncoder

# Importing TSNE
from sklearn.manifold import TSNE

# To perform hierarchical clustering, compute cophenetic correlation, and create dendrograms
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage, cophenet

# To compute distances
from scipy.spatial.distance import pdist

# To import K-Medoids

from sklearn_extra.cluster import KMedoids

# To import Gaussian Mixture
from sklearn.mixture import GaussianMixture

# To import DBSCAN
from sklearn.cluster import DBSCAN

# To supress warnings
import warnings

warnings.filterwarnings("ignore")
Requirement already satisfied: scikit-learn in /opt/anaconda3/lib/python3.9/site-packages (1.0.2)
Requirement already satisfied: numpy>=1.14.6 in /opt/anaconda3/lib/python3.9/site-packages (from scikit-learn) (1.21.5)
Requirement already satisfied: scipy>=1.1.0 in /opt/anaconda3/lib/python3.9/site-packages (from scikit-learn) (1.7.3)
Requirement already satisfied: joblib>=0.11 in /opt/anaconda3/lib/python3.9/site-packages (from scikit-learn) (1.1.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/anaconda3/lib/python3.9/site-packages (from scikit-learn) (2.2.0)
Requirement already satisfied: scikit-learn-extra in /opt/anaconda3/lib/python3.9/site-packages (0.2.0)
Requirement already satisfied: scikit-learn>=0.23.0 in /opt/anaconda3/lib/python3.9/site-packages (from scikit-learn-extra) (1.0.2)
Requirement already satisfied: numpy>=1.13.3 in /opt/anaconda3/lib/python3.9/site-packages (from scikit-learn-extra) (1.21.5)
Requirement already satisfied: scipy>=0.19.1 in /opt/anaconda3/lib/python3.9/site-packages (from scikit-learn-extra) (1.7.3)
Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/anaconda3/lib/python3.9/site-packages (from scikit-learn>=0.23.0->scikit-learn-extra) (2.2.0)
Requirement already satisfied: joblib>=0.11 in /opt/anaconda3/lib/python3.9/site-packages (from scikit-learn>=0.23.0->scikit-learn-extra) (1.1.0)

Loading the data¶

In [420]:
data = pd.read_csv("marketing_campaign.csv")

Checking the shape of the data¶

In [421]:
data.shape
Out[421]:
(2240, 27)

Observations and Insights:¶

This is a customer database of 2.240 entries and 27 columns.

In [422]:
# Checking if there are duplicated entries.
data[data.duplicated()]
Out[422]:
ID Year_Birth Education Marital_Status Income Kidhome Teenhome Dt_Customer Recency MntWines ... NumCatalogPurchases NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Response

0 rows × 27 columns

Understanding the data by observing a few rows¶

In [423]:
data.head(10)
Out[423]:
ID Year_Birth Education Marital_Status Income Kidhome Teenhome Dt_Customer Recency MntWines ... NumCatalogPurchases NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Response
0 5524 1957 Graduation Single 58138.0 0 0 04-09-2012 58 635 ... 10 4 7 0 0 0 0 0 0 1
1 2174 1954 Graduation Single 46344.0 1 1 08-03-2014 38 11 ... 1 2 5 0 0 0 0 0 0 0
2 4141 1965 Graduation Together 71613.0 0 0 21-08-2013 26 426 ... 2 10 4 0 0 0 0 0 0 0
3 6182 1984 Graduation Together 26646.0 1 0 10-02-2014 26 11 ... 0 4 6 0 0 0 0 0 0 0
4 5324 1981 PhD Married 58293.0 1 0 19-01-2014 94 173 ... 3 6 5 0 0 0 0 0 0 0
5 7446 1967 Master Together 62513.0 0 1 09-09-2013 16 520 ... 4 10 6 0 0 0 0 0 0 0
6 965 1971 Graduation Divorced 55635.0 0 1 13-11-2012 34 235 ... 3 7 6 0 0 0 0 0 0 0
7 6177 1985 PhD Married 33454.0 1 0 08-05-2013 32 76 ... 0 4 8 0 0 0 0 0 0 0
8 4855 1974 PhD Together 30351.0 1 0 06-06-2013 19 14 ... 0 2 9 0 0 0 0 0 0 1
9 5899 1950 PhD Together 5648.0 1 1 13-03-2014 68 28 ... 0 0 20 1 0 0 0 0 0 0

10 rows × 27 columns

In [424]:
data.tail(10)
Out[424]:
ID Year_Birth Education Marital_Status Income Kidhome Teenhome Dt_Customer Recency MntWines ... NumCatalogPurchases NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Response
2230 7004 1984 Graduation Single 11012.0 1 0 16-03-2013 82 24 ... 1 2 9 1 0 0 0 0 0 0
2231 9817 1970 Master Single 44802.0 0 0 21-08-2012 71 853 ... 4 12 8 0 0 0 0 0 0 0
2232 8080 1986 Graduation Single 26816.0 0 0 17-08-2012 50 5 ... 0 3 4 0 0 0 0 0 0 0
2233 9432 1977 Graduation Together 666666.0 1 0 02-06-2013 23 9 ... 1 3 6 0 0 0 0 0 0 0
2234 8372 1974 Graduation Married 34421.0 1 0 01-07-2013 81 3 ... 0 2 7 0 0 0 0 0 0 0
2235 10870 1967 Graduation Married 61223.0 0 1 13-06-2013 46 709 ... 3 4 5 0 0 0 0 0 0 0
2236 4001 1946 PhD Together 64014.0 2 1 10-06-2014 56 406 ... 2 5 7 0 0 0 1 0 0 0
2237 7270 1981 Graduation Divorced 56981.0 0 0 25-01-2014 91 908 ... 3 13 6 0 1 0 0 0 0 0
2238 8235 1956 Master Together 69245.0 0 1 24-01-2014 8 428 ... 5 10 3 0 0 0 0 0 0 0
2239 9405 1954 PhD Married 52869.0 1 1 15-10-2012 40 84 ... 1 4 7 0 0 0 0 0 0 1

10 rows × 27 columns

Observations and Insights:¶

Here we get a first view of the names of the columns and the type of data collected.

Let us check the data types and and missing values of each column¶

In [425]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2240 entries, 0 to 2239
Data columns (total 27 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   ID                   2240 non-null   int64  
 1   Year_Birth           2240 non-null   int64  
 2   Education            2240 non-null   object 
 3   Marital_Status       2240 non-null   object 
 4   Income               2216 non-null   float64
 5   Kidhome              2240 non-null   int64  
 6   Teenhome             2240 non-null   int64  
 7   Dt_Customer          2240 non-null   object 
 8   Recency              2240 non-null   int64  
 9   MntWines             2240 non-null   int64  
 10  MntFruits            2240 non-null   int64  
 11  MntMeatProducts      2240 non-null   int64  
 12  MntFishProducts      2240 non-null   int64  
 13  MntSweetProducts     2240 non-null   int64  
 14  MntGoldProds         2240 non-null   int64  
 15  NumDealsPurchases    2240 non-null   int64  
 16  NumWebPurchases      2240 non-null   int64  
 17  NumCatalogPurchases  2240 non-null   int64  
 18  NumStorePurchases    2240 non-null   int64  
 19  NumWebVisitsMonth    2240 non-null   int64  
 20  AcceptedCmp3         2240 non-null   int64  
 21  AcceptedCmp4         2240 non-null   int64  
 22  AcceptedCmp5         2240 non-null   int64  
 23  AcceptedCmp1         2240 non-null   int64  
 24  AcceptedCmp2         2240 non-null   int64  
 25  Complain             2240 non-null   int64  
 26  Response             2240 non-null   int64  
dtypes: float64(1), int64(23), object(3)
memory usage: 472.6+ KB
In [426]:
# Finding the percentage of missing values in each column of the data
(data.isnull().sum()/data.shape[0]*100)
Out[426]:
ID                     0.000000
Year_Birth             0.000000
Education              0.000000
Marital_Status         0.000000
Income                 1.071429
Kidhome                0.000000
Teenhome               0.000000
Dt_Customer            0.000000
Recency                0.000000
MntWines               0.000000
MntFruits              0.000000
MntMeatProducts        0.000000
MntFishProducts        0.000000
MntSweetProducts       0.000000
MntGoldProds           0.000000
NumDealsPurchases      0.000000
NumWebPurchases        0.000000
NumCatalogPurchases    0.000000
NumStorePurchases      0.000000
NumWebVisitsMonth      0.000000
AcceptedCmp3           0.000000
AcceptedCmp4           0.000000
AcceptedCmp5           0.000000
AcceptedCmp1           0.000000
AcceptedCmp2           0.000000
Complain               0.000000
Response               0.000000
dtype: float64

Observations and Insights:¶

The only column that has missing values is Income. I will treat it later on in this project.

We can also observe that ID has no null values. And the number of unique values are equal to the number of observations. So, ID looks like an index for the data entry and such a column would not be useful in providing any predictive power for our analysis. Hence, it can be dropped.

Dropping the ID column:

In [427]:
data1 = data.copy()
data.drop("ID", axis = 1, inplace= True)

Exploratory Data Analysis¶

Summary statistics of numerical variables¶

In [428]:
data.describe().T
Out[428]:
count mean std min 25% 50% 75% max
Year_Birth 2240.0 1968.805804 11.984069 1893.0 1959.00 1970.0 1977.00 1996.0
Income 2216.0 52247.251354 25173.076661 1730.0 35303.00 51381.5 68522.00 666666.0
Kidhome 2240.0 0.444196 0.538398 0.0 0.00 0.0 1.00 2.0
Teenhome 2240.0 0.506250 0.544538 0.0 0.00 0.0 1.00 2.0
Recency 2240.0 49.109375 28.962453 0.0 24.00 49.0 74.00 99.0
MntWines 2240.0 303.935714 336.597393 0.0 23.75 173.5 504.25 1493.0
MntFruits 2240.0 26.302232 39.773434 0.0 1.00 8.0 33.00 199.0
MntMeatProducts 2240.0 166.950000 225.715373 0.0 16.00 67.0 232.00 1725.0
MntFishProducts 2240.0 37.525446 54.628979 0.0 3.00 12.0 50.00 259.0
MntSweetProducts 2240.0 27.062946 41.280498 0.0 1.00 8.0 33.00 263.0
MntGoldProds 2240.0 44.021875 52.167439 0.0 9.00 24.0 56.00 362.0
NumDealsPurchases 2240.0 2.325000 1.932238 0.0 1.00 2.0 3.00 15.0
NumWebPurchases 2240.0 4.084821 2.778714 0.0 2.00 4.0 6.00 27.0
NumCatalogPurchases 2240.0 2.662054 2.923101 0.0 0.00 2.0 4.00 28.0
NumStorePurchases 2240.0 5.790179 3.250958 0.0 3.00 5.0 8.00 13.0
NumWebVisitsMonth 2240.0 5.316518 2.426645 0.0 3.00 6.0 7.00 20.0
AcceptedCmp3 2240.0 0.072768 0.259813 0.0 0.00 0.0 0.00 1.0
AcceptedCmp4 2240.0 0.074554 0.262728 0.0 0.00 0.0 0.00 1.0
AcceptedCmp5 2240.0 0.072768 0.259813 0.0 0.00 0.0 0.00 1.0
AcceptedCmp1 2240.0 0.064286 0.245316 0.0 0.00 0.0 0.00 1.0
AcceptedCmp2 2240.0 0.012946 0.113069 0.0 0.00 0.0 0.00 1.0
Complain 2240.0 0.009375 0.096391 0.0 0.00 0.0 0.00 1.0
Response 2240.0 0.149107 0.356274 0.0 0.00 0.0 0.00 1.0

Observations and Insights:¶

This is the first exploration of the statistics of the numerical data.

  • Regarding customer profile: the year of the birth mean of this base is 1968, although it is going to adjust a little bit after we drop the outliers (there are some of them that are obvious mistakes). Having no children is the average situation of these clients, with a mean of 0.475 for kids and teens living with them in the same household. This is something interesting to consider when profiling the customer segments.
  • Regarding customer purchase behavior: there are two products whose mean for the amount spent stands out: wines are in the first place having a mean of the amount spent in the last 2 years of 303.9, followed by meat products that have a mean of 166.95 in the same case. In the third place is gold products, with a mean of 44. Some of these columns have outliers, and we will see later how we will treat them.
  • Regarding the number of purchases made per type of channel: the highest mean is for the in-store purchases with a 5.79 mean, followed by the purchases made through the web, with a 4.08 mean. Now, if we consider that the number of visits to the company’s website in the last month has a mean of 5.3, I think we can analyze if there exists a positive correlation between them.
  • Regarding customer responsiveness to the marketing campaigns: these features have Boolean values. 1 denotes that the customer accepted the campaign and 0 that they didn’t. Well, these data confirm the challenge the marketing has right now: the rate of campaign acceptance is very low. The campaign that has the highest mean has a mean of 0.07.
  • Finally, a good thing to notice from the data is that the rate of complaints is almost non-existent, with a mean of 0.009.
  • Outliers: some outliers will need to be treated. One of them is a mistake, as we can see in the Year_Birth column. There are also outliers in the Income column that are affecting the mean, same with the MntWines column.

Summary statistics of all categorical variables and the number of unique observations in each category¶

In [429]:
cols = ["Education", "Marital_Status", "Kidhome", "Teenhome"]

Number of unique observations in each category:

In [430]:
for column in cols:
    print("Unique values in", column, "are :")
    print(data[column].value_counts(normalize = True))
    print("*" * 50)
Unique values in Education are :
Graduation    0.503125
PhD           0.216964
Master        0.165179
2n Cycle      0.090625
Basic         0.024107
Name: Education, dtype: float64
**************************************************
Unique values in Marital_Status are :
Married     0.385714
Together    0.258929
Single      0.214286
Divorced    0.103571
Widow       0.034375
Alone       0.001339
Absurd      0.000893
YOLO        0.000893
Name: Marital_Status, dtype: float64
**************************************************
Unique values in Kidhome are :
0    0.577232
1    0.401339
2    0.021429
Name: Kidhome, dtype: float64
**************************************************
Unique values in Teenhome are :
0    0.516964
1    0.459821
2    0.023214
Name: Teenhome, dtype: float64
**************************************************

Observations and Insights:¶

A first exploration:

  • Regarding Education: we can notice that this column has five different values. In all cases, more than 50% of the customers have graduation as their education level, followed by 25% of customers that have a Master’s degree or something called a second cycle, and 21% that have a Ph.D.
  • Regarding Marital Status: this column has eight different types of values. We can merge some of them to make a better analysis of these data.
  • Regarding Children: There are two columns with data of the same nature. We can merge them later. As it is, the data show that more than 50% of customers don’t have children or teens living at home with them, and 43% have only one child or teenager in the household.

More data treatment:¶

Let's replace the "2n Cycle" category with "Master" in Education and "Alone", "Absurd, and "YOLO" with "Single" in Marital_Status¶

In [431]:
# Replace the category "2n Cycle" with the category "Master"

data["Education"].replace("2n Cycle", "Master", inplace = True)  # Hint: Use the replace() method and inplace=True
In [432]:
# Replace the categories "Alone", "Absurd", "YOLO" with the category "Single"

data["Marital_Status"].replace("Alone", "Single", inplace = True)  # Hint: Use the replace() method and inplace=True
In [433]:
data["Marital_Status"].replace(["Absurd", "YOLO"], "Others", inplace = True)

Univariate Analysis¶

Univariate analysis is used to explore each variable in a data set, separately. It looks at the range of values, as well as the central tendency of the values. It can be done for both numerical and categorical variables.

1. Univariate Analysis - Numerical Data¶

Histograms help to visualize and describe numerical data. We can also use other plots like box plot to analyze the numerical columns.

Let us plot histogram for the feature 'Income' to understand the distribution and outliers, if any.¶

In [434]:
plt.figure(figsize=(10, 7))
sns.histplot(x="Income", data=data)
plt.show()

We could observe some extreme value on the right side of the distribution of the 'Income' feature. Let's use a box plot as it is more suitable to identify extreme values in the data.

In [435]:
sns.boxplot(data=data, x="Income", showmeans=True, color="violet")
Out[435]:
<AxesSubplot:xlabel='Income'>

Observations and Insights:¶

There are some outliers in Income. Let's analyze them a bit further to see decide to keep them or drop them.

In [436]:
# Calculating the upper whisker for the Income variable

Q1 = data.quantile(q=0.25)                          # Finding the first quartile

Q3 = data.quantile(q=0.75)                          # Finding the third quartile

IQR = Q3 - Q1                                       # Finding the Inter Quartile Range

upper_whisker = (Q3 + 1.5 * IQR)["Income"]          # Calculating the Upper Whisker for the Income variable

print(upper_whisker)                                # Printing Upper Whisker
118350.5
In [437]:
# Let's check the observations with extreme value for the Income variable
data[data.Income > upper_whisker]
Out[437]:
Year_Birth Education Marital_Status Income Kidhome Teenhome Dt_Customer Recency MntWines MntFruits ... NumCatalogPurchases NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Response
164 1973 PhD Married 157243.0 0 1 01-03-2014 98 20 2 ... 22 0 0 0 0 0 0 0 0 0
617 1976 PhD Together 162397.0 1 1 03-06-2013 31 85 1 ... 0 1 1 0 0 0 0 0 0 0
655 1975 Graduation Divorced 153924.0 0 0 07-02-2014 81 1 1 ... 0 0 0 0 0 0 0 0 0 0
687 1982 PhD Married 160803.0 0 0 04-08-2012 21 55 16 ... 28 1 0 0 0 0 0 0 0 0
1300 1971 Master Together 157733.0 1 0 04-06-2013 37 39 1 ... 0 1 1 0 0 0 0 0 0 0
1653 1977 Graduation Together 157146.0 0 0 29-04-2013 13 1 0 ... 28 0 1 0 0 0 0 0 0 0
2132 1949 PhD Married 156924.0 0 0 29-08-2013 85 2 1 ... 0 0 0 0 0 0 0 0 0 0
2233 1977 Graduation Together 666666.0 1 0 02-06-2013 23 9 14 ... 1 3 6 0 0 0 0 0 0 0

8 rows × 26 columns

In [438]:
# Checking the 99.5% percentile value for the Income variable
data.quantile(q=0.995)["Income"]
Out[438]:
102145.75000000003

Observations and Insights:¶

These outliers represent only 0.3% of the customer base. Because they do not hold any statistical significance, I will drop them.

In [439]:
# Dropping observations identified as outliers 
data.drop(index=[164, 617, 655, 687, 1300, 1653, 2132, 2233], inplace=True) # Pass the indices of the observations (separated by a comma) to drop them

Now, let's check the distribution of the Income variable after dropping outliers.

In [440]:
plt.figure(figsize=(10, 7))
sns.histplot(x="Income", data=data)
plt.show()
In [441]:
plt.figure(figsize=(10, 7))
sns.histplot(x="MntWines", data=data)
plt.show()
In [442]:
plt.figure(figsize=(10, 7))
sns.histplot(x="MntMeatProducts", data=data)
plt.show()
In [443]:
plt.figure(figsize=(10, 7))
sns.histplot(x="MntGoldProds", data=data)
plt.show()
In [444]:
plt.figure(figsize=(10, 7))
sns.histplot(x="MntFishProducts", data=data)
plt.show()
In [445]:
plt.figure(figsize=(10, 7))
sns.histplot(x="MntSweetProducts", data=data)
plt.show()
In [446]:
plt.figure(figsize=(10, 7))
sns.histplot(x="MntFruits", data=data)
plt.show()
In [447]:
plt.figure(figsize=(10, 7))
sns.histplot(x="Recency", data=data)
plt.show()
In [448]:
plt.figure(figsize=(10, 7))
sns.histplot(x="Year_Birth", data=data)
plt.show()
In [449]:
plt.figure(figsize=(10, 7))
sns.histplot(x="NumStorePurchases", data=data)
plt.show()
In [450]:
plt.figure(figsize=(10, 7))
sns.histplot(x="NumWebPurchases", data=data)
plt.show()
In [451]:
plt.figure(figsize=(10, 7))
sns.histplot(x="NumWebVisitsMonth", data=data)
plt.show()
In [452]:
plt.figure(figsize=(10, 7))
sns.histplot(x="NumCatalogPurchases", data=data)
plt.show()
In [453]:
plt.figure(figsize=(10, 7))
sns.histplot(x="NumDealsPurchases", data=data)
plt.show()

2. Univariate analysis - Categorical Data¶

Let us write a function that will help us create bar plots that indicate the percentage for each category. This function takes the categorical column as the input and returns the bar plot for the variable.

In [454]:
def perc_on_bar(z):
    '''
    plot
    feature: categorical feature
    the function won't work if a column is passed in hue parameter
    '''

    total = len(data[z])                                          # Length of the column
    plt.figure(figsize=(15,5))
    ax = sns.countplot(data[z],palette='Paired',order = data[z].value_counts().index)
    for p in ax.patches:
        percentage = '{:.1f}%'.format(100 * p.get_height()/total) # Percentage of each class of the category
        x = p.get_x() + p.get_width() / 2 - 0.05                  # Width of the plot
        y = p.get_y() + p.get_height()                            # Height of the plot
        
        ax.annotate(percentage, (x, y), size = 12)                # Annotate the percentage 
    
    plt.show()                                                    # Show the plot

Let us plot barplot for the categorical variables:¶

In [455]:
perc_on_bar('Marital_Status')
In [456]:
perc_on_bar('Education')
In [457]:
perc_on_bar('Kidhome')
In [458]:
perc_on_bar('Teenhome')

Bivariate Analysis¶

We have analyzed different categorical and numerical variables. Now, let's check how different variables are related to each other.

Correlation Heat map¶

Heat map can show a 2D correlation matrix between numerical features.

In [459]:
plt.figure(figsize=(17, 10))                                                        
sns.heatmap(data.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap='rocket', linewidths=3) 
Out[459]:
<AxesSubplot:>

Observations and Insights:¶

The insights captured from this heat map are very interesting. I chose a threshold of 0.60 as a moderate correlation to consider as worthy of further analysis. Based on this, we can observe the following moderate positive correlation:

  • Catalog purchases and amount spent on wine products: 0.64
  • Store purchases and amount spent on wine products: 0.64
  • Catalog purchases and amount spent on meat products: 0.72 These moderate positive correlations may be considered for further exploration and analysis in the next steps of this project.

The above correlation heatmap only shows the relationship between numerical variables. Let's check the relationship of numerical variables with categorical variables.

Education Vs Income¶

In [460]:
print(sns.barplot(x="Education", y="Income", data=data))
AxesSubplot(0.125,0.125;0.775x0.755)

Marital Status Vs Income¶

In [461]:
print(sns.barplot(x="Marital_Status", y="Income", data=data))
AxesSubplot(0.125,0.125;0.775x0.755)

Kidhome Vs Income¶

In [462]:
print(sns.barplot(x="Kidhome", y="Income", data=data))
AxesSubplot(0.125,0.125;0.775x0.755)

We can also visualize the relationship between two categorical variables.

Marital_Status Vs Kidhome¶

In [463]:
pd.crosstab(data.Marital_Status, data.Kidhome).plot(kind="bar",stacked=False)
Out[463]:
<AxesSubplot:xlabel='Marital_Status'>

Feature Engineering and Data Processing¶

In this section, we will first prepare our dataset for analysis.

  • Creating new columns
  • Imputing missing values

Think About It:

  • The Year_Birth column in the current format might not be very useful in our analysis. The Year_Birth column contains the information about Day, Month, and year. Can we extract the age of each customer?
  • Are there other columns which can be used to create new features?

Age¶

In [464]:
# Extracting only the year from the Year_Birth variable and subtracting it from 2016 will give us the age of the customer at the time of data collection in 2016

data["Age"] = 2016 - pd.to_datetime(data["Year_Birth"], format="%Y").apply(lambda x: x.year) 

# Sorting the values in ascending order 
data["Age"].sort_values()                                         
Out[464]:
1170     20
46       20
696      21
747      21
1850     21
       ... 
424      75
1950     76
192     116
339     117
239     123
Name: Age, Length: 2232, dtype: int64
In [465]:
# Finding and dropping the outliers, any age above 115.
data[data["Age"] > 115]
Out[465]:
Year_Birth Education Marital_Status Income Kidhome Teenhome Dt_Customer Recency MntWines MntFruits ... NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Response Age
192 1900 Master Divorced 36640.0 1 0 26-09-2013 99 15 6 ... 2 5 0 0 0 0 0 1 0 116
239 1893 Master Single 60182.0 0 1 17-05-2014 23 8 0 ... 2 4 0 0 0 0 0 0 0 123
339 1899 PhD Together 83532.0 0 0 26-09-2013 36 755 144 ... 4 1 0 0 1 0 0 0 0 117

3 rows × 27 columns

In [466]:
data.drop([192, 239, 339], axis=0, inplace=True)
In [467]:
data.shape
Out[467]:
(2229, 27)

Now, let's check the distribution of age in the data.

In [468]:
plt.figure(figsize=(15, 7))
sns.histplot(x="Age", data=data)
plt.show()

Kids¶

  • Let's create feature "Kids" indicating the total kids and teens in the home.
In [469]:
# Adding Kidhome and Teenhome variables to create the new feature called "Kids"
data["Kids"] = data["Kidhome"] + data["Teenhome"]

Family Size¶

  • Let's create a new variable called 'Family Size' to find out how many members each family has.
  • For this, we need to have a look at the Marital_Status variable, and see what are the categories.
In [470]:
# Checking the unique categories in Marial_Status
data["Marital_Status"].unique()
Out[470]:
array(['Single', 'Together', 'Married', 'Divorced', 'Widow', 'Others'],
      dtype=object)
In [471]:
data["Marital_Status"].value_counts()
Out[471]:
Married     861
Together    575
Single      482
Divorced    230
Widow        77
Others        4
Name: Marital_Status, dtype: int64
  • We can combine the sub-categories Single, Divorced, Widow as "Single" and we can combine the sub-categories Married and Together as "Relationship"
  • Then we can create a new variable called "Status" and assign values 1 and 2 to categories Single and Relationship, respectively.
  • Then, we can use the Kids (calculated above) and the Status column to find the family size.
In [472]:
# Replacing "Married" and "Together" with "Relationship"
data["Marital_Status"].replace(["Married", "Together"], "Relationship", inplace = True)
In [473]:
# Replacing "Divorced" and "Widow" with "Single"
data["Marital_Status"].replace(["Divorced", "Widow"], "Single", inplace = True)
In [474]:
data["Marital_Status"].replace({"Single": 1, "Others": 1, "Relationship": 2}, inplace = True) 
In [475]:
data["Marital_Status"].value_counts()
Out[475]:
2    1436
1     793
Name: Marital_Status, dtype: int64
In [476]:
# Adding two variables Status and Kids to get the total number of persons in each family
data["Family_Size"] = data["Marital_Status"] + data["Kids"]

Expenses¶

  • Let's create a new feature called "Expenses", indicating the total amount spent by the customers in various products over the span of two years.
In [477]:
data["Expenses"] = data["MntWines"] + data["MntFruits"] + data["MntMeatProducts"] + data["MntFishProducts"] + data["MntSweetProducts"] + data["MntGoldProds"]

Total Purchases¶

  • Let's create a new feature called "NumTotalPurchases", indicating the total number of products purchased by the customers.
In [478]:
data["NumTotalPurchases"] = data["NumDealsPurchases"] + data["NumWebPurchases"] + data["NumCatalogPurchases"] + data["NumStorePurchases"]

Engaged in Days¶

  • Let's create a new feature called "Engaged in days", indicating how long the customer has been with the company.
In [479]:
# Converting Dt_customer variable to Python date time object
data["Dt_Customer"] = pd.to_datetime(data["Dt_Customer"]) 

# Sorting the values in ascending order 
data["Dt_Customer"].sort_values()        
Out[479]:
2029   2012-01-08
724    2012-01-08
976    2012-01-08
2194   2012-01-08
1473   2012-01-09
          ...    
1952   2014-12-05
216    2014-12-05
288    2014-12-05
153    2014-12-05
2003   2014-12-06
Name: Dt_Customer, Length: 2229, dtype: datetime64[ns]

Let's check the max and min of the date.

In [480]:
# Checking the minimum of the date

min(data.Dt_Customer)
Out[480]:
Timestamp('2012-01-08 00:00:00')
In [481]:
# Checking the maximum of the date

max(data.Dt_Customer)
Out[481]:
Timestamp('2014-12-06 00:00:00')

Think About It:

  • From the above output from the max function, we observed that the last customer enrollment date is December 6th, 2014. Can we extract the number of days a customer has been with the company using some date as the threshold? Can January 1st, 2015 be that threshold?
In [482]:
 # Assigning date to the day variable
data["day"] = "01-01-2015"                         

# Converting the variable day to Python datetime object
data["day"] = pd.to_datetime(data.day)              
In [483]:
data["Engaged_in_days"] = (data["day"] - data["Dt_Customer"]).dt.days     

TotalAcceptedCmp¶

  • Let's create a new feature called "TotalAcceptedCmp" that shows how many offers customers have accepted.
In [484]:
data["TotalAcceptedCmp"] = data["AcceptedCmp1"] + data["AcceptedCmp2"] + data["AcceptedCmp3"] + data["AcceptedCmp4"] + data["AcceptedCmp5"]

AmountPerPurchase¶

  • Let's create a new feature called "AmountPerPurchase" indicating the amount spent per purchase.
In [485]:
data['AmountPerPurchase'] = data['Expenses'] / data['NumTotalPurchases']

Now, let's check the maximum value of the AmountPerPurchase.

In [486]:
max(data.AmountPerPurchase)
Out[486]:
inf
In [487]:
# Finding how many observations have NumTotalPurchases equal to 0

data[data["NumTotalPurchases"] == 0]
Out[487]:
Year_Birth Education Marital_Status Income Kidhome Teenhome Dt_Customer Recency MntWines MntFruits ... Response Age Kids Family_Size Expenses NumTotalPurchases day Engaged_in_days TotalAcceptedCmp AmountPerPurchase
981 1965 Graduation 1 4861.0 0 0 2014-06-22 20 2 1 ... 0 51 0 1 6 0 2015-01-01 193 0 inf
1524 1973 Graduation 1 3502.0 1 0 2013-04-13 56 2 1 ... 0 43 1 2 5 0 2015-01-01 628 0 inf

2 rows × 35 columns

In [488]:
# Dropping the observations with NumTotalPurchases equal to 0

data.drop([981, 1524], axis=0, inplace=True)
In [489]:
max(data.AmountPerPurchase)
Out[489]:
1679.0
In [490]:
min(data.AmountPerPurchase)
Out[490]:
0.5333333333333333

Now, let's check the distribution of values in AmountPerPurchase column.

In [491]:
data["AmountPerPurchase"].describe().T
Out[491]:
count    2227.000000
mean       33.274270
std        45.040897
min         0.533333
25%         9.714286
50%        23.352941
75%        45.281773
max      1679.000000
Name: AmountPerPurchase, dtype: float64
In [492]:
plt.figure(figsize=(10, 7))
sns.histplot(x="AmountPerPurchase", data=data)
plt.show()

Imputing Missing Values for Income¶

In [493]:
# Imputing the missing values for the Income variable with the median

data["Income"].fillna(data.Income.median(), inplace = True)

Now that we are done with data preprocessing, let's visualize new features against the new income variable we have after imputing missing values.

Income Vs Expenses¶

In [494]:
# Plot the scatter plot with Expenses on Y-axis and Income on X-axis  

plt.figure(figsize=(10, 7))                                    # Setting the plot size

sns.scatterplot(data = data, x = "Income", y = "Expenses")      # Hint: Use sns.scatterplot()  

plt.xticks(fontsize=16)                                         # Font size of X-label

plt.yticks(fontsize=16)                                         # Font size of Y-label

plt.xlabel("Income", fontsize=20, labelpad=20)                  # Title of X-axis

plt.ylabel("Expenses", fontsize=20, labelpad=20)                # Title of Y-axis
Out[494]:
Text(0, 0.5, 'Expenses')

Family Size Vs Income¶

In [495]:
print(sns.barplot(x="Family_Size", y="Income", data=data))
AxesSubplot(0.125,0.125;0.775x0.755)

Proposed approach¶

Considering the data exploration and preparation already done, my proposed approach would be to follow the next steps:

  1. Select the columns that represent behavioral customer data, that are the ones that will be used for segmentation, and put them aside from the columns that will be used only for profiling and not for segmentation, and which are the ones that contain variables related to demographic information about customers.
  2. Scaling the data, so all variables will have a standard scale. This is especially important because I will apply distance-based clustering algorithms.
  3. Applying dimensionality reduction techniques: Principal Component Analysis and TSNE.
  4. Applying the selected clustering techniques, to find segments in the customer database. The models I will apply in this project are K-Means, K-Medoids, Hierarchical Clustering, DBSCAN, and Gaussian Mixture Model.
  5. I will find at least three clear, distinctive clusters to represent three customer segments of all the solutions. These three clusters will be profiled and presented to the marketing team as a suggested solution.
In [496]:
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2227 entries, 0 to 2239
Data columns (total 35 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   Year_Birth           2227 non-null   int64         
 1   Education            2227 non-null   object        
 2   Marital_Status       2227 non-null   int64         
 3   Income               2227 non-null   float64       
 4   Kidhome              2227 non-null   int64         
 5   Teenhome             2227 non-null   int64         
 6   Dt_Customer          2227 non-null   datetime64[ns]
 7   Recency              2227 non-null   int64         
 8   MntWines             2227 non-null   int64         
 9   MntFruits            2227 non-null   int64         
 10  MntMeatProducts      2227 non-null   int64         
 11  MntFishProducts      2227 non-null   int64         
 12  MntSweetProducts     2227 non-null   int64         
 13  MntGoldProds         2227 non-null   int64         
 14  NumDealsPurchases    2227 non-null   int64         
 15  NumWebPurchases      2227 non-null   int64         
 16  NumCatalogPurchases  2227 non-null   int64         
 17  NumStorePurchases    2227 non-null   int64         
 18  NumWebVisitsMonth    2227 non-null   int64         
 19  AcceptedCmp3         2227 non-null   int64         
 20  AcceptedCmp4         2227 non-null   int64         
 21  AcceptedCmp5         2227 non-null   int64         
 22  AcceptedCmp1         2227 non-null   int64         
 23  AcceptedCmp2         2227 non-null   int64         
 24  Complain             2227 non-null   int64         
 25  Response             2227 non-null   int64         
 26  Age                  2227 non-null   int64         
 27  Kids                 2227 non-null   int64         
 28  Family_Size          2227 non-null   int64         
 29  Expenses             2227 non-null   int64         
 30  NumTotalPurchases    2227 non-null   int64         
 31  day                  2227 non-null   datetime64[ns]
 32  Engaged_in_days      2227 non-null   int64         
 33  TotalAcceptedCmp     2227 non-null   int64         
 34  AmountPerPurchase    2227 non-null   float64       
dtypes: datetime64[ns](2), float64(2), int64(30), object(1)
memory usage: 626.3+ KB

Milestone 2¶

Preparing Data for Segmentation¶

Dropping columns that we will not use for segmentation¶

The decision about which variables to use for clustering is a critically important decision that will have a big impact on the clustering solution. So we need to think carefully about the variables we will choose for clustering. Clearly, this is a step where a lot of contextual knowledge, creativity, and experimentation/iterations are needed.

Moreover, we often use only a few of the data attributes for segmentation (the segmentation attributes) and use some of the remaining ones (the profiling attributes) only to profile the clusters. For example, in market research and market segmentation, we can use behavioral data for segmentation (to segment the customers based on their behavior like amount spent, units bought, etc.), and then use both demographic as well as behavioral data for profiling the segments found.

Here, we will use the behavioral attributes for segmentation and drop the demographic attributes like Income, Age, and Family_Size. In addition to this, we need to drop some other columns which are mentioned below.

  • Dt_Customer: We have created the Engaged_in_days variable using the Dt_Customer variable. Hence, we can drop this variable as it will not help with segmentation.
  • Complain: About 95% of the customers didn't complain and have the same value for this column. This variable will not have a major impact on segmentation. Hence, we can drop this variable.
  • day: We have created the Engaged_in_days variable using the 'day' variable. Hence, we can drop this variable as it will not help with segmentation.
  • Status: This column was created just to get the Family_Size variable that contains the information about the Status. Hence, we can drop this variable.
  • We also need to drop categorical variables like Education and Marital_Status, Kids, Kidhome, and Teenhome as distance-based algorithms cannot use the default distance like Euclidean to find the distance between categorical and numerical variables.
  • We can also drop categorical variables like AcceptedCmp1, AcceptedCmp2, AcceptedCmp3, AcceptedCmp4, AcceptedCmp5, and Response for which we have create the variable TotalAcceptedCmp which is the aggregate of all these variables.
In [497]:
# Dropping all the irrelevant columns and storing in data_model
data_model = data.drop(
    columns=[
        "Year_Birth",
        "Dt_Customer",
        "day",
        "Complain",
        "Response",
        "AcceptedCmp1",
        "AcceptedCmp2",
        "AcceptedCmp3",
        "AcceptedCmp4",
        "AcceptedCmp5",
        "Marital_Status",
        "Kids",
        'Education',
        'Kidhome',
        'Teenhome', 'Income','Age', 'Family_Size'
    ],
    axis=1,
)
In [498]:
data_model.shape
Out[498]:
(2227, 17)
In [499]:
# Check first five rows of new data
data_model.head(5)
Out[499]:
Recency MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth Expenses NumTotalPurchases Engaged_in_days TotalAcceptedCmp AmountPerPurchase
0 58 635 88 546 172 88 88 3 8 10 4 7 1617 25 997 0 64.680000
1 38 11 1 6 2 1 6 2 1 1 2 5 27 6 151 0 4.500000
2 26 426 49 127 111 21 42 1 8 2 10 4 776 21 498 0 36.952381
3 26 11 4 20 10 3 5 2 2 0 4 6 53 8 91 0 6.625000
4 94 173 43 118 46 27 15 5 5 3 6 5 422 19 347 0 22.210526

Let's plot the correlation plot after we've removed the irrelevant variables.

In [500]:
# Plot the correlation plot for new data
plt.figure(figsize=(15, 7))
sns.heatmap(data_model.corr(), annot= True, vmin=-1, vmax=1,fmt=" .2f", cmap='rocket', linewidths=3)
plt.show()

Observations and insights: I choose a threshold of 0.60 as a minimum to consider that a correlation moderately positive exists between variables. These are the correlations this heat map shows over that threshold:

1) Expenses with:

  • MntWines: 0.90 (**)
  • MntFruits: 0.62
  • MntMeatProducts: 0.85
  • MntFishProducts: 0.64
  • MntSweetProducts: 0.60
  • NumCatalogPurchases: 0.80
  • NumStorePurchases: 0.68
  • NumTotalPurchases: 0.75
  • AmountPerPurchase: 0.62

2) NumTotalPurchases with:

  • MntWines: 0.72
  • NumWebPurchases: 0.79
  • NumCatalogPurchases: 0.74
  • NumStorePurchases: 0.83

3) MntWines with:

  • Expenses: 0.90 (**)
  • NumCatalogPurchases: 0.67
  • NumStorePurchases: 0.64

4) MntMeatProducts with:

  • NumCatalogPuchases: 0.70
  • AmountPerPurchase: 0.62

Scaling the Data¶

What is feature scaling?

Feature scaling is a class of statistical techniques that, as the name implies, scales the features of our data so that they all have a similar range. You'll understand better if we look at an example:

If you have multiple independent variables like Age, Income, and Amount related variables, with their range as (18–100 Years), (25K–75K), and (100–200), respectively, feature scaling would help them all to be in the same range.

Why feature scaling is important in Unsupervised Learning?

Feature scaling is especially relevant in machine learning models that compute some sort of distance metric as we do in most clustering algorithms, for example, K-Means.

So, scaling should be done to avoid the problem of one feature dominating over others because the unsupervised learning algorithm uses distance to find the similarity between data points.

Let's scale the data

Standard Scaler: StandardScaler standardizes a feature by subtracting the mean and then scaling to unit variance. Unit variance means dividing all the values by the standard deviation.

  1. Data standardization is the process of rescaling the attributes so that they have a mean of 0 and a variance of 1.
  2. The ultimate goal to perform standardization is to bring down all the features to a common scale without distorting the differences in the range of the values.
  3. In sklearn.preprocessing.StandardScaler(), centering and scaling happen independently on each feature.
In [501]:
# Applying standard scaler on new data
scaler = StandardScaler()                            # Initialize the Standard Scaler

data_scaled = StandardScaler().fit_transform(data_model)     # fit_transform the scaler function on new data

data_scaled = pd.DataFrame(data_scaled, columns=data_model.columns)  # Converting the embeddings to a dataframe

data_scaled.head(10)
Out[501]:
Recency MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth Expenses NumTotalPurchases Engaged_in_days TotalAcceptedCmp AmountPerPurchase
0 0.306906 0.979274 1.549793 1.735137 2.455586 1.471064 0.841828 0.357919 1.404892 2.633008 -0.561330 0.696875 1.682349 1.329371 1.975678 -0.440703 0.697428
1 -0.384035 -0.873925 -0.638021 -0.726901 -0.652191 -0.633425 -0.732464 -0.169952 -1.119121 -0.586255 -1.178960 -0.135935 -0.963691 -1.167064 -1.667464 -0.440703 -0.638991
2 -0.798600 0.358572 0.569049 -0.175222 1.340442 -0.149634 -0.041311 -0.697824 1.404892 -0.228559 1.291559 -0.552339 0.282777 0.803806 -0.173173 -0.440703 0.081680
3 -0.798600 -0.873925 -0.562579 -0.663070 -0.505942 -0.585045 -0.751662 -0.169952 -0.758548 -0.943951 -0.561330 0.280470 -0.920422 -0.904281 -1.925843 -0.440703 -0.591801
4 1.550599 -0.392806 0.418165 -0.216256 0.152175 -0.004497 -0.559676 1.413662 0.323172 0.129137 0.056299 -0.135935 -0.306341 0.541023 -0.823427 -0.440703 -0.245693
5 -1.144070 0.637739 0.393018 -0.307442 -0.688753 0.358346 -0.578874 -0.169952 0.683745 0.486833 1.291559 0.280470 0.182926 0.935197 -0.254993 -0.440703 -0.016185
6 -0.522224 -0.208674 0.971406 -0.006527 0.225299 0.527673 -0.329291 0.885790 1.044319 0.129137 0.365114 0.280470 -0.026760 0.803806 1.036902 -0.440703 -0.115011
7 -0.591318 -0.680883 -0.411695 -0.498934 -0.633910 -0.633425 -0.406086 -0.169952 -0.037401 -0.943951 -0.561330 1.113279 -0.727378 -0.641499 -0.104272 -0.440703 -0.363624
8 -1.040429 -0.865015 -0.663168 -0.644833 -0.633910 -0.585045 -0.809258 -0.697824 -0.397975 -0.943951 -1.178960 1.529684 -0.932072 -1.167064 0.154107 -0.440703 -0.568669
9 0.652376 -0.823437 -0.663168 -0.726901 -0.670472 -0.633425 -0.598073 -0.697824 -1.119121 -0.943951 -1.796590 6.110134 -0.927079 -1.692629 -1.051661 1.035154 -0.194850

Applying T-SNE and PCA to the data to visualize the data distributed in 2 dimensions¶

Applying T-SNE¶

In [502]:
# Fitting T-SNE with number of components equal to 2 to visualize how data is distributed

tsne = TSNE(n_components=2, random_state=1, perplexity=35) # Initializing T-SNE with number of component equal to 2, random_state=1, and perplexity=35

data_customer_tsne = tsne.fit_transform(data_model)  # fit_transform T-SNE on new data

data_customer_tsne = pd.DataFrame(data_customer_tsne, columns=[0, 1])  # Converting the embeddings to a dataframe

plt.figure(figsize=(7, 7))               # Scatter plot for two components

sns.scatterplot(x=0, y=1, data=data_customer_tsne)  # Plotting T-SNE
Out[502]:
<AxesSubplot:xlabel='0', ylabel='1'>
In [503]:
# Fitting T-SNE with number of components equal to 2 to visualize how data is distributed

tsne = TSNE(n_components=2, random_state=1, perplexity=15) # Initializing T-SNE with number of component equal to 2, random_state=1, and perplexity=35

data_customer_tsne = tsne.fit_transform(data_model)  # fit_transform T-SNE on new data

data_customer_tsne = pd.DataFrame(data_customer_tsne, columns=[0, 1])  # Converting the embeddings to a dataframe

plt.figure(figsize=(7, 7))               # Scatter plot for two components

sns.scatterplot(x=0, y=1, data=data_customer_tsne)  # Plotting T-SNE
Out[503]:
<AxesSubplot:xlabel='0', ylabel='1'>
In [504]:
# Fitting T-SNE with number of components equal to 2 to visualize how data is distributed

tsne = TSNE(n_components=2, random_state=1, perplexity=50) # Initializing T-SNE with number of component equal to 2, random_state=1, and perplexity=35

data_customer_tsne = tsne.fit_transform(data_model)  # fit_transform T-SNE on new data

data_customer_tsne = pd.DataFrame(data_customer_tsne, columns=[0, 1])  # Converting the embeddings to a dataframe

plt.figure(figsize=(7, 7))               # Scatter plot for two components

sns.scatterplot(x=0, y=1, data=data_customer_tsne)  # Plotting T-SNE
Out[504]:
<AxesSubplot:xlabel='0', ylabel='1'>

Observation and Insights: I tried three values for perplexity: 35, 15 and 50. From these scatterplots, it is not possible to observe or find any meaningful clusters that would give and idea of how the data is distributed. So far, is not possible to identify underlying cluster patterns in the data.

Applying PCA¶

When the variables used in clustering are highly correlated, it causes multicollinearity, which affects the clustering method and results in poor cluster profiling (or biased toward a few variables). PCA can be used to reduce the multicollinearity between the variables.

In [505]:
# Defining the number of principal components to generate
n = data_model.shape[1]                                   # Storing the number of variables in the data

pca = PCA(n_components=n, random_state = 1)          # Initialize PCA with n_components = n and random_state=1

data_pca = pd.DataFrame(pca.fit_transform(data_scaled))      # fit_transform PCA on the scaled data

# The percentage of variance explained by each principal component is stored
exp_var = pca.explained_variance_ratio_    
                     

Let's plot the first two components and see how the data points are distributed.

In [506]:
# Scatter plot for two components using the dataframe data_pca
plt.figure(figsize = (7, 7))
sns.scatterplot(x = data_pca[0], y = data_pca[1])
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.show()

Let's apply clustering algorithms on the data generated after applying PCA

K-Means¶

In [507]:
distortions = []                                                  # Create an empty list

K = range(2, 10)                                                  # Setting the K range from 2 to 10

for k in K:
    kmeanModel = KMeans(n_clusters=k,random_state=4)              # Initialize K-Means
    kmeanModel.fit(data_pca)                                      # Fit K-Means on the data
    distortions.append(kmeanModel.inertia_)                       # Append distortion values to the empty list created above
In [508]:
# Plotting the elbow plot
plt.figure(figsize=(16, 8))                                            # Setting the plot size

plt.plot(K, distortions, "bx-")                                        # Plotting the K on X-axis and distortions on y-axis

plt.xlabel("k")                                                        # Title of x-axis

plt.ylabel("Distortion")                                               # Title of y-axis

plt.title("The Elbow Method showing the optimal k")                    # Title of the plot
plt.show()

Observations: It is possible to observe drops of the curve before numbers 3, 4 and 5.

We can use the silhouette score as a metric for different K values to make a better decision about picking the number of clusters(K).

What is the silhouette score?¶

Silhouette score is one of the methods for evaluating the quality of clusters created using clustering algorithms such as K-Means. The silhouette score is a measure of how similar an object is to its cluster (cohesion) compared to other clusters (separation). Silhouette score has a range of [-1, 1].

  • Silhouette coefficients near +1 indicate that the clusters are dense and well separated, which is good.
  • Silhouette score near -1 indicates that those samples might have been assigned to the wrong cluster.

Finding silhouette score for each value of K

In [509]:
sil_score = []                                                             # Creating empty list
cluster_list = range(3, 8)                                                 # Creating a range from 3 to 7
for n_clusters in cluster_list:
    
    # Initialize K-Means with number of clusters equal to n_clusters and random_state=1
    clusterer = KMeans(n_clusters = n_clusters, random_state=1)
    
    # Fit and predict on the pca data
    
    preds = clusterer.fit(data_pca).predict(data_pca)
    
    # Calculate silhouette score - Hint: Use silhouette_score() function
    score = silhouette_score(data_pca, preds)  
    
    # Append silhouette score to empty list created above
    sil_score.append(score)
            
    # Print the silhouette score
    print( "For n_clusters = {}, the silhouette score is {})".format(n_clusters, score))   
For n_clusters = 3, the silhouette score is 0.2700522050350727)
For n_clusters = 4, the silhouette score is 0.2576362652878985)
For n_clusters = 5, the silhouette score is 0.22583613398899802)
For n_clusters = 6, the silhouette score is 0.24314705614051696)
For n_clusters = 7, the silhouette score is 0.12983168949386734)

Observations and Insights:

  • About the Elbow Method graph: We can observe that there is barely one clear, big drop of the curve before number 3, and a slightly, almost imperceptible one before number 4 and before number 7. Before choosing which K to use, let's see what is the Silhouette score for n number of K.
  • About the Silhouette score: I explored how much is the Silhouette score for a range between 3 and 8. The numbers with the highest Silhouette score are 3, 4 and 5. So, I will try the three of them.

From the above silhouette scores, 3 appears to be a good value of K. So, let's build K-Means using K=3.

Applying K-Means on data_pca¶

Let's build K-Means using K=3 first

In [510]:
kmeans = KMeans(n_clusters= 3, random_state= 1) # Initialize the K-Means algorithm with 3 clusters and random_state=1

kmeans.fit(data_pca)                                # Fitting on the data_pca
Out[510]:
KMeans(n_clusters=3, random_state=1)
In [511]:
data_pca["K_means_segments_3"] = kmeans.labels_                    # Adding K-Means cluster labels to the data_pca data

data["K_means_segments_3"] = kmeans.labels_                        # Adding K-Means cluster labels to the whole data

data_model["K_means_segments_3"] = kmeans.labels_                  # Adding K-Means cluster labels to data_model
In [512]:
# Let's check the distribution
data_model["K_means_segments_3"].value_counts()
Out[512]:
1    1046
0     613
2     568
Name: K_means_segments_3, dtype: int64

Visualizing these clusters as a histogram:

In [513]:
plt.figure(figsize=(15,7))
sns.histplot(x='K_means_segments_3', data=data_model)
plt.show()

Let's visualize these clusters using PCA

In [514]:
# Function to visualize PCA data with clusters formed
def PCA_PLOT(X, Y, PCA, cluster):
    sns.scatterplot(x=X, y=1, data=PCA, hue=cluster)
In [515]:
PCA_PLOT(0, 1, data_pca, "K_means_segments_3")

Observations and Insights:

  • About the distribution of observations: K=3 results are 3 clusters that have an asymmetric, well balanced distribution, so this is a solution that I would like to keep in mind. Let's see later if this is supported by the profiling results.
  • About the scatterplot: Although these clusters do not have strong boundaries and overlap on the side where they meet, it is possible to observe that there are three of them fairly defined.
  • Cluster 0 seems to have the least density of all, followed by cluster 2, whereas cluster 1 is the one that has the most amount of observations and here it looks as the most tight of the three.
  • Cluster 0 and 2 have similar shape, and cluster 1 is a bit different looking smaller, not spread out as the other two.
  • Most of data points follow a similar pattern, except for a few data points that are far away from the clusters.

Cluster Profiling for K=3¶

In [516]:
# Taking the cluster-wise mean of all the variables. Hint: First groupby 'data' by 'K_means_segments_3' and then find mean
cluster_profile_KMeans_3 = data.groupby('K_means_segments_3').mean()
In [517]:
# Highlighting the maximum average value among all the clusters for each of the variables
cluster_profile_KMeans_3.style.highlight_max(color="lightgreen", axis=0)
Out[517]:
  Year_Birth Marital_Status Income Kidhome Teenhome Recency MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Response Age Kids Family_Size Expenses NumTotalPurchases Engaged_in_days TotalAcceptedCmp AmountPerPurchase
K_means_segments_3                                                                
0 1965.972268 1.654160 57357.499184 0.288744 0.831974 47.712887 443.132137 22.879282 135.845024 30.053834 24.394780 61.404568 3.773246 6.464927 3.060359 7.843393 5.771615 0.073409 0.119086 0.019576 0.035889 0.014682 0.008157 0.133768 50.027732 1.120718 2.774878 717.709625 21.141925 593.830343 0.262643 33.698297
1 1971.021033 1.652008 35170.983748 0.756214 0.474187 49.165392 43.802103 4.975143 23.100382 7.272467 5.079350 14.972275 2.009560 2.113767 0.564054 3.268642 6.377629 0.066922 0.015296 0.000000 0.000956 0.001912 0.010516 0.085086 44.978967 1.230402 2.882409 99.201721 7.956023 499.522945 0.085086 11.196880
2 1968.109155 1.621479 75881.901408 0.038732 0.220070 50.540493 637.966549 69.542254 459.471831 101.890845 70.908451 79.267606 1.330986 5.220070 6.005282 8.325704 2.910211 0.084507 0.137324 0.264085 0.213028 0.031690 0.007042 0.286972 47.890845 0.258803 1.880282 1419.047535 20.882042 549.441901 0.730634 73.473251

Observations and Insights on each cluster:

  • Cluster 0 has 613 observations. It corresponds to a segment of customers that have the highest mean in columns Marital_Status, Teenhome, NumDealsPurchases, NumWebPurchases, Age, NumTotalPurchases, Engaged_in_days.

  • Cluster 1 has 1046 observations. It corresponds to a segment of customers that has the highest mean in columns Year_Birth, Kidhome, NumWebVisitsMonth, Complain, Kids, and Family_Size.

  • Cluster 2: it has 568 observations. It has the highest mean in columns Income, Recency, all products columns, NumCatalogPurchases, NumStorePurchases, all AcceptedCmp columns, Response, Expenses, TotalAcceptedCmp, and AmountPerPurchase.

Let us create a boxplot for each of the variables

In [518]:
# Columns to use in boxplot
col_for_box = ['Income', 'Kidhome', 'Teenhome', 'Recency','MntWines','MntFruits','MntMeatProducts','MntFishProducts','MntSweetProducts','MntGoldProds','NumDealsPurchases','NumWebPurchases','NumCatalogPurchases','NumStorePurchases','NumWebVisitsMonth','Complain','Age', 'Family_Size','Expenses','NumTotalPurchases','Engaged_in_days','TotalAcceptedCmp','AmountPerPurchase']
In [519]:
# Creating boxplot for each of the variables
all_col = col_for_box

plt.figure(figsize = (30, 50))

for i, variable in enumerate(all_col):
    plt.subplot(6, 4, i + 1)
    
    sns.boxplot(y=data[variable], x=data['K_means_segments_3'],showmeans=True)
    
    plt.tight_layout()
    
    plt.title(variable)

plt.show()

Characteristics of each cluster:¶

Cluster 0:

  • They are medium income customers
  • They live with a kid at home
  • They have mostly the same recency as cluster 1, with 75% of purchases made between 25 and 70 days ago.
  • They spent the most on wine, followed by meat, gold, fish, sweet, and fruit, with several outliers in all of them (on the upper end).
  • They are the segment that most purchases deals, with a median of 3.
  • They purchase mainly in store, followed by website purchases.
  • They make a few purchases per catalog too but is the channel they use the least.
  • They frequently visit the website, with a median of 6 visits per month. -They don’t have any complain so far.
  • 50% of them are between 42 and 59 years old, with a median of 50 years old.
  • 50% have spent between 500 and 900 in products of the store.
  • They are the earliest clients, with a median of 600 days been engaged with the store.
  • They haven’t accepted any campaign.

Summary for cluster 0: They are medium-income customers, living with a kid at home. They spent the most on wines, and they mainly purchase at the store. They are the segment that purchases deals the most and also visit the website frequently. The catalog is the channel they use the least. They are the earliest clients, been engaged with the store for a median of 600 days. The median is 50 years old.

Cluster 1 profile:

  • They are the segment with the lowest income of all customers, with a median of 35,000 aprox.
  • They live with kids or teenagers at home.
  • The majority of them have a recency (number of days since last purchase) between 25 days and 75 days.
  • They are the segment that spent the least amount on wine, fruits, meat, fish, sweet and gold products. But of all of them, they spent the most on wine, followed by gold. They have plenty of outliers in all of these categories though.
  • They buy some deals, with a median of 2 purchases.
  • They are the segment that uses the website the least to make purchases, but at the same time they are the ones that visit the website the most per month.
  • They almost don’t use catalogs to purchase.
  • The channel they most use to purchase is in store, followed by web purchases and deal purchases.
  • They haven’t presented any complain.
  • 50% of them are between 38 to 53 years old, with a median of 45 years old, and they include the youngest clients at the bottom of the segment that are 20 years old.
  • They have spent the least amount in products overall, although there are several outliers on the upper side.
  • They have made the least total number of purchases of all segments, with 75% of all under $10.
  • They included the newest customers, with 50% of them between 300 days aprox. and almost 700 days.
  • They have not accepted any campaign.
  • The amount per purchase is the lowest of all segments, with some outliers on the upper end.

Summary for cluster 1: This cluster is made up of low-income customers, that live with a kid at home. They are between 38 and 53 years old. They spent the least amount overall but when they buy something, they purchase deals and mostly in store. They mainly purchase wine. They are the one that visit the website the most. They haven't accepted any campaign.

Cluster 2 profile:

  • High-income customers, with no children living at home with them.
  • They have the highest median on recency (number of days since last purchase was made), of 55 days.
  • Largest amount spent on wine, fruits, meat, fish, sweet, and gold products during the last 2 years.
  • They do not purchase deals
  • They use web and catalog channels for purchase similarly, but the main channel they use is in store purchases.
  • 50% of them are between 49 and 59 years old, with a median of 48 years old.
  • They have the smallest family size, probably because they do not have children at home living with them.
  • They are the ones that spend the largest amount in the store, with a median of total purchases of 21.
  • It is a customer base that has been client of the store mainly between 390 days and more than 700 days (with a median of 580 days aprox.).
  • And it is the only segment that has accepted a campaign.
  • The average amount per purchase of this segment is the largest.
  • They have not complain.

Summary for cluster 2: This cluster is made up of high-income customers, with no children living at home with them. They have spent the largest amounts in all products, during the last 2 years. The main channel they use is in store purchases, and they do not buy deals. They are between 49 and 59 years old. They have accepted a campaign. They have been clients for more than a year.

In [520]:
# Dropping labels we got from K=3 since we will be using PCA data for prediction
# Drop K_means_segments_3. Hint: Use axis=1 and inplace=True
data_pca.drop(['K_means_segments_3'], axis=1, inplace=True)
data.drop(['K_means_segments_3'], axis=1, inplace=True)

Let's build K-Means using K=4

In [521]:
# Fit the K-Means algorithm using number of cluster as 4 and random_state=1 on data_pca
kmeans = KMeans(n_clusters= 4, random_state= 1).fit(data_pca)  
In [522]:
data_pca["K_means_segments_4"] = kmeans.labels_                    # Adding K-Means cluster labels to the data_pca data

data["K_means_segments_4"] = kmeans.labels_                        # Adding K-Means cluster labels to the whole data

data_model["K_means_segments_4"] = kmeans.labels_                  # Add K-Means cluster labels to data_model  
In [523]:
# Let's check the distribution
data_model["K_means_segments_4"].value_counts()
Out[523]:
0    1003
2     558
1     468
3     198
Name: K_means_segments_4, dtype: int64

Visualizing these clusters as a histogram:

In [524]:
plt.figure(figsize=(15,7))
sns.histplot(x='K_means_segments_4', data=data_model)
plt.show()

Let's visualize the clusters using PCA

In [525]:
# Function to visualize PCA data with clusters formed
def PCA_PLOT(X, Y, PCA, cluster):
    sns.scatterplot(x=X, y=1, data=PCA, hue=cluster)
In [526]:
PCA_PLOT(0, 1, data_pca, "K_means_segments_4")

Observations and Insights:

  • About the distribution of observations: K=4 results are 4 clusters that have a left skewed distribution, the largest being almost ten times larger than the smallest one.
  • About the scatterplot: We can observe that the only cluster that looks well defined is cluster 0. Then, cluster 2 is overlaping with cluster 1, and cluster 1 and cluster 3 are completely overlapped.

Most data points follow a similar pattern to what we saw on K=3, we can tell they have similar Silhouette scores.

Cluster Profiling¶

In [527]:
# Take the cluster-wise mean of all the variables. Hint: First groupby 'data' by cluster labels column and then find mean
cluster_profile_KMeans_4 = data.groupby('K_means_segments_4').mean()
In [528]:
# Highlight the maximum average value among all the clusters for each of the variables
cluster_profile_KMeans_4.style.highlight_max(color="lightgreen", axis=0)
Out[528]:
  Year_Birth Marital_Status Income Kidhome Teenhome Recency MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Response Age Kids Family_Size Expenses NumTotalPurchases Engaged_in_days TotalAcceptedCmp AmountPerPurchase
K_means_segments_4                                                                
0 1971.286142 1.655035 34797.108674 0.758724 0.457627 49.201396 38.528415 4.860419 21.354935 7.089731 5.076770 14.475573 1.900299 2.012961 0.529412 3.212363 6.343968 0.063809 0.012961 0.000000 0.000997 0.001994 0.010967 0.082752 44.713858 1.216351 2.871386 91.385842 7.655035 493.780658 0.079761 10.846512
1 1967.713675 1.630342 71614.416667 0.057692 0.311966 50.085470 479.042735 72.032051 398.544872 102.068376 71.410256 81.948718 1.628205 5.282051 5.570513 8.589744 2.995726 0.025641 0.014957 0.044872 0.061966 0.000000 0.008547 0.141026 48.286325 0.369658 2.000000 1205.047009 21.070513 547.267094 0.147436 59.625060
2 1965.643369 1.648746 55119.998208 0.345878 0.876344 47.695341 415.107527 17.336918 116.621864 24.019713 17.978495 55.842294 4.077061 6.327957 2.729391 7.272401 6.130824 0.080645 0.116487 0.010753 0.026882 0.008961 0.007168 0.136201 50.356631 1.222222 2.870968 646.906810 20.406810 605.026882 0.243728 31.229751
3 1968.666667 1.616162 80181.363636 0.045455 0.186869 50.398990 936.141414 52.873737 481.833333 78.898990 60.601010 72.196970 1.151515 5.641414 6.141414 8.363636 3.414141 0.212121 0.414141 0.681818 0.500000 0.111111 0.005051 0.550505 47.333333 0.232323 1.848485 1682.545455 21.297980 553.606061 1.919192 90.363729

Observations and Insights on each cluster:

  • Cluster 0 has 1003 observations. It corresponds to a segment of customers that have the highest mean in columns Year_Birth, Marital_Status, Kidhome, NumWebVisitsMonth, Complain, and Family_Size.
  • Cluster 1 has 468 observations. It corresponds to a segment of customers that has the highest mean in columns MntFruits, MntFishProducts, MntSweetProducts, MntGoldProds, NumStorePurchases.
  • Cluster 2 has 558 observations. It has the highest mean in columns Teenhome, NumDealsPurchases, NumWebPurchases, Age, Engaged_in_days.
  • Cluster 3 has 198 observations. It corresponds to a segment of customers that have the highest mean in columns Income, Recency, MntWines, MntMeatProducts, NumCatalogPurchases, in all Acceptedcmp columns, Response, Expenses, TotalAcceptedCmp, NumTotalPurchases, and AmountPerPurchase.

From this information, I do not recognize any clear pattern that differentiate one cluster from another. So let's see if there is any relevant additional information comes up by observing the boxplots.

Let's plot the boxplot

In [529]:
# Columns to use in boxplot
col_for_box = ['Income','Kidhome','Teenhome','Recency','MntWines','MntFruits','MntMeatProducts','MntFishProducts','MntSweetProducts','MntGoldProds','NumDealsPurchases','NumWebPurchases','NumCatalogPurchases','NumStorePurchases','NumWebVisitsMonth','Complain','Age','Family_Size','Expenses','NumTotalPurchases','Engaged_in_days','TotalAcceptedCmp','AmountPerPurchase']
In [530]:
# Creating boxplot for each of the variables
all_col = col_for_box

plt.figure(figsize = (30, 50))

for i, variable in enumerate(all_col):
    plt.subplot(6, 4, i + 1)
    
    sns.boxplot(y=data[variable], x=data['K_means_segments_4'],showmeans=True)
    
    plt.tight_layout()
    
    plt.title(variable)

plt.show()

Characteristics of each cluster¶

The features where these clusters differ the most are:

Cluster 0 profile:

  • Low-income customers, with a kid and a teenager at home living with them.
  • They are the segment that buys the least of all products.
  • They are the segment that purchases through website and catalog the least of all segments.
  • They mainly buy in store, followed by website purchases.
  • The are the ones that visit the website the most per month though.
  • They include the youngest clients, with a minimum of 20 years old, and a median of 44 years old.
  • Overall, they spend the least amount of money of all segments, and also have the lowest number of total purchases.
  • They are the newest clients, and haven't accepted any campaign.

Summary for cluster 0: This cluster is made up of low-income customers, include the youngest clients of 20 years old and a median of 44 years old. They live with children or teenagers at home, they spend low amounts of money in the store, have the lowest rates of purchases, amount spent and have not accepted campaigns. They buy in store, and are the ones that visit the website the most monthly.

Cluster 1 profile:

  • Medium-high income customers
  • No kids at home, but with a teenager.
  • They are the segment that buys the most of fish, fruits, sweets and gold products.
  • Of all segments, they are the ones that most buy in store. The second channel they use for purchases is catalog, followed by website purchases.
  • They make the largest number of total purchases, of all segments.
  • Their expenses in all products are 50% between 1,000 and almost 1,500
  • They have not accepted any campaign.

Summary for cluster 1: This cluster is made up of medium-high income customers. They live with a teenager at home. They make the largest number of total purchases of all segments, buying wine in the first place, and purchase mostly in store.

Cluster 2 profile:

  • Medium income customers
  • Have a kid at home living with them
  • They are the segment that buys deals the most, and makes web purchases the most of all segments
  • The product they buy the most is wine, followed by meat and fish. They mainly buy in store, then website purchases and third place catalog purchases.
  • They are the oldest segment, with a median of 50 years old.
  • And they also are the segment that has been engaged more than all other segments, with a median over 600 days since they became clients.
  • They have not accepted any campaign.

Summary for cluster 2: This cluster is made up of medium income customers. They live with a kid at home. They buy deals and web purchases more than all other segments, buying wine in the first place. They have been engaded with the store more than all other segments, with a median of 600 days.

Cluster 3 profile:

  • High income customers
  • They do not live with kid or teenagers at home.
  • They buy more wines that all other segments, with a median of almost 1,000 amount spent per month, followed by meat products.
  • They don’t buy deals
  • They buy mainly from catalog and then in store
  • They haven’t complained
  • They are the widest range on age, with a minimum of 21 years old and maximum of 75 years old aprox., but the median is 47 years old.
  • They make the largest amount of total expenses with a median of 1,600 aprox.
  • They are the segment with the second largest number of total purchases
  • They are the only segment that has accepted any campaign.

Summary for cluster 3: They are the segment that has spent the most in the store, but are the smallest segment, with only 198 observations in it. They are high income customers, that in first place buy wines followed by meat, from catalog and in store. They don't buy deals and they are the only ones that have accepted any campaign. They make the largest amount of total expenses in the store, with a median of 1,600 aprox.

In [531]:
# Dropping labels we got from K-Means since we will be using PCA data for prediction
# Hint: Use axis=1 and inplace=True
data_pca.drop(['K_means_segments_4'], axis=1, inplace=True)
data.drop(['K_means_segments_4'], axis=1, inplace=True)

Let's build K-Means using K=5

In [532]:
# Fit the K-Means algorithm using number of cluster as 5 and random_state=0 on data_pca
kmeans = KMeans(n_clusters= 5, random_state= 0).fit(data_pca)  
In [533]:
data_pca["K_means_segments_5"] = kmeans.labels_                    # Adding K-Means cluster labels to the data_pca data

data["K_means_segments_5"] = kmeans.labels_                        # Adding K-Means cluster labels to the whole data

data_model["K_means_segments_5"] = kmeans.labels_                  # Add K-Means cluster labels to data_model  
In [534]:
# Let's check the distribution
data_model["K_means_segments_5"].value_counts()
Out[534]:
0    999
4    546
2    477
1    204
3      1
Name: K_means_segments_5, dtype: int64

Visualizing these clusters as a histogram:

In [535]:
plt.figure(figsize=(15,7))
sns.histplot(x='K_means_segments_5', data=data_model)
plt.show()

Let's visualize the clusters using PCA

In [536]:
# Function to visualize PCA data with clusters formed
def PCA_PLOT(X, Y, PCA, cluster):
    sns.scatterplot(x=X, y=1, data=PCA, hue=cluster)
In [537]:
PCA_PLOT(0, 1, data_pca, "K_means_segments_5")

Observations and Insights:

  • About the distribution of observations: K=5 results in 5 clusters that have also a left skewed distribution, very similar to the plot made from K=4 results, although this solution is unbalanced in a different way. Here we can observe that there is a cluster that has only one observation, and two clusters that are almost completely overlapped.

Most data points follow a similar pattern to what we saw on K=3, we can tell they have similar Silhouette scores. Although this solution K=5 and K=4, are not as well separated and defined as K=3.

Cluster Profiling¶

In [538]:
# Take the cluster-wise mean of all the variables. Hint: First groupby 'data' by cluster labels column and then find mean
cluster_profile_KMeans_5 = data.groupby('K_means_segments_5').mean()
In [539]:
# Highlight the maximum average value among all the clusters for each of the variables
cluster_profile_KMeans_5.style.highlight_max(color="lightgreen", axis=0)
Out[539]:
  Year_Birth Marital_Status Income Kidhome Teenhome Recency MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Response Age Kids Family_Size Expenses NumTotalPurchases Engaged_in_days TotalAcceptedCmp AmountPerPurchase
K_means_segments_5                                                                
0 1971.260260 1.654655 34751.676677 0.758759 0.458458 49.336336 38.158158 4.805806 21.265265 6.977978 5.061061 14.208208 1.897898 2.003003 0.523524 3.210210 6.343343 0.063063 0.013013 0.000000 0.001001 0.002002 0.011011 0.082082 44.739740 1.217217 2.871872 90.476476 7.634635 493.024024 0.079079 10.789356
1 1968.965686 1.622549 80288.941176 0.044118 0.181373 49.803922 931.220588 55.745098 481.661765 81.960784 61.318627 74.784314 1.151961 5.627451 6.176471 8.406863 3.397059 0.205882 0.401961 0.681373 0.500000 0.107843 0.004902 0.544118 47.034314 0.225490 1.848039 1686.691176 21.362745 556.764706 1.897059 82.595043
2 1967.599581 1.628931 71320.025157 0.060797 0.331237 50.283019 477.234801 70.324948 389.989518 99.467505 70.660377 80.190776 1.693920 5.354298 5.547170 8.616352 3.058700 0.025157 0.012579 0.037736 0.056604 0.000000 0.008386 0.138365 48.400419 0.392034 2.020964 1187.867925 21.211740 545.127883 0.132075 58.486627
3 1978.000000 2.000000 51315.000000 0.000000 0.000000 53.000000 32.000000 2.000000 1607.000000 12.000000 4.000000 22.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 38.000000 0.000000 2.000000 1679.000000 1.000000 754.000000 1.000000 1679.000000
4 1965.630037 1.648352 54730.915751 0.355311 0.875458 47.430403 410.366300 16.500000 112.236264 23.360806 16.976190 56.049451 4.087912 6.293040 2.652015 7.184982 6.177656 0.084249 0.119048 0.009158 0.025641 0.009158 0.007326 0.137363 50.369963 1.230769 2.879121 635.489011 20.217949 607.529304 0.247253 30.946379

Observations and Insights on each cluster:

  • Cluster 0 has 999 observations. It corresponds to a segment of customers that have the highest mean in columns Kidhome, NumWebVisitsMonth, and Complain.
  • Cluster 1 has 204 observations. It corresponds to a segment of customers that has the highest mean in columns Income, MntWines, NumCatalogPurchases, AcceptedCmp3, AcceptedCmp5, AcceptedCmp1, AcceptedCmp2, Response, Expenses, NumTotalPurchases, and TotalAcceptedCmp.
  • Cluster 2 has 477 observations. It corresponds to a segment of customers that has the highest mean in columns MntFruits, MntFishProducts, MntSweetProducts, MntGoldProds, NumStorePurchases,
  • Cluster 3 has 1 observation. It has the highest mean in columns Year_Birth, Marital_Status, Recency, MntMeatProducts, AcceptedCmp4, Engaged_in_days, AmountPerPurchase. - Cluster 4 has 546 observations. It corresponds to a segment of customers that has the highest mean in columns Teenhome, NumDealsPurchases, NumWebPurchases, Age, Kids, Family_Size.

From this information, I do not recognize any clear pattern that differentiate one cluster from another either. In fact, I think that the profiling qualities of each cluster don't make any sense; they don't seem to reveal anything useful from each cluster. Moreover, the distribution of the clusters is odd and confusing. We have a cluster with one observations, and two clusters that are almost completely overlapped. I think K=5 does not add anything valuable to the results we got from K=4, so I will keep K=4 as my chosen result from this model.

Let's plot the boxplot

In [540]:
# Columns to use in boxplot
col_for_box = ['Income','Kidhome','Teenhome','Recency','MntWines','MntFruits','MntMeatProducts','MntFishProducts','MntSweetProducts','MntGoldProds','NumDealsPurchases','NumWebPurchases','NumCatalogPurchases','NumStorePurchases','NumWebVisitsMonth','Complain','Age','Family_Size','Expenses','NumTotalPurchases','Engaged_in_days','TotalAcceptedCmp','AmountPerPurchase']
In [541]:
# Creating boxplot for each of the variables
all_col = col_for_box

plt.figure(figsize = (30, 50))

for i, variable in enumerate(all_col):
    plt.subplot(6, 4, i + 1)
    
    sns.boxplot(y=data[variable], x=data['K_means_segments_5'],showmeans=True)
    
    plt.tight_layout()
    
    plt.title(variable)

plt.show()
In [542]:
# Dropping labels we got from K-Means since we will be using PCA data for prediction
# Hint: Use axis=1 and inplace=True
data_pca.drop(['K_means_segments_5'], axis=1, inplace=True)
data.drop(['K_means_segments_5'], axis=1, inplace=True)

K-Medoids¶

Let's find the silhouette score for K=5 in K-Medoids

In [543]:
kmedo = KMedoids(n_clusters= 5, random_state= 1)    # Initializing K-Medoids with number of clusters as 5 and random_state=1

preds = kmedo.fit(data_pca).predict(data_pca)     # Fit and predict K-Medoids using data_pca

score = silhouette_score(data_pca, preds)  # Calculate the silhouette score
            
print(score)                   # Print the score
0.10383413504610778

Observations and Insights: The Silhouette score of K=5 in K-Medoids is very low compared to the scores we got in K-Means.

In [544]:
# Predicting on data_pca and adding K-Medoids cluster labels to the whole data  
data['K_medoids_segments'] = preds


# Predicting on data_pca and ddding K-Medoids cluster labels to data_model
data_model['K_medoids_segments'] = preds


# Predicting on data_pca and adding K-Medoids cluster labels to data_pca
data_pca['K_medoids_segments'] = preds
In [545]:
# Let's check the distribution
data_model["K_medoids_segments"].value_counts()
Out[545]:
3    603
1    587
0    467
2    302
4    268
Name: K_medoids_segments, dtype: int64

Visualizing these clusters as a histogram:

In [546]:
plt.figure(figsize=(15,7))
sns.histplot(x='K_medoids_segments', data=data_model)
plt.show()

Let's visualize the clusters using PCA

In [547]:
# Hint: Use PCA_PLOT function created above
def PCA_PLOT(X, Y, PCA, cluster):
    sns.scatterplot(x=X, y=1, data=PCA, hue=cluster)
In [548]:
PCA_PLOT(0, 1, data_pca, "K_medoids_segments")

Observations and Insights:

  • About the distribution of observations: As we could expect from a low Silhouette score, whese clusters are overlapped between them, they do not show any density that could be helpful to identify clear patterns and clusters in the data. Although they are well balanced in terms of number of observations in each cluster, the fact that they do not differentiate very well from each other is not helpful to solve our business problem.

Cluster Profiling¶

In [549]:
# Take the cluster-wise mean of all the variables. Hint: First group 'data' by cluster labels column and then find mean
cluster_profile_KMedoids = data.groupby('K_medoids_segments').mean()
In [550]:
# Highlight the maximum average value among all the clusters for each of the variables
cluster_profile_KMedoids.style.highlight_max(color="lightgreen", axis=0)
Out[550]:
  Year_Birth Marital_Status Income Kidhome Teenhome Recency MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Response Age Kids Family_Size Expenses NumTotalPurchases Engaged_in_days TotalAcceptedCmp AmountPerPurchase
K_medoids_segments                                                                
0 1968.661670 1.605996 76628.961456 0.042827 0.216274 50.927195 667.261242 75.179872 483.462527 108.940043 79.886510 83.111349 1.381156 5.509636 5.944325 8.605996 3.081370 0.092077 0.141328 0.299786 0.241970 0.034261 0.004283 0.323340 47.338330 0.259101 1.865096 1497.841542 21.441113 568.668094 0.809422 76.349673
1 1965.683135 1.654174 61775.618399 0.187394 0.724020 46.025554 494.132879 28.928450 186.342419 39.562181 28.495741 66.158433 3.189097 6.091993 3.945486 8.195911 4.875639 0.071550 0.120954 0.037479 0.040886 0.018739 0.010221 0.126065 50.316865 0.911414 2.565588 843.620102 21.422487 573.446337 0.289608 40.554090
2 1971.546358 1.625828 33470.874172 0.751656 0.493377 42.516556 76.447020 7.927152 37.817881 12.235099 7.725166 25.745033 3.109272 3.172185 0.867550 3.576159 7.516556 0.072848 0.013245 0.000000 0.000000 0.000000 0.019868 0.208609 44.453642 1.245033 2.870861 167.897351 10.725166 785.850993 0.086093 14.460202
3 1972.320066 1.643449 33215.016584 0.817579 0.419569 46.266998 22.242123 3.001658 13.457711 4.499171 3.084577 9.807629 1.548922 1.615257 0.270315 2.873964 6.452736 0.077944 0.006633 0.000000 0.000000 0.003317 0.009950 0.061360 43.679934 1.237148 2.880597 56.092869 6.308458 417.504146 0.087894 8.589409
4 1965.589552 1.716418 48003.895522 0.522388 0.757463 66.578358 155.432836 9.085821 51.194030 12.679104 8.641791 26.078358 2.914179 3.947761 1.343284 4.899254 5.223881 0.033582 0.082090 0.000000 0.026119 0.000000 0.000000 0.033582 50.410448 1.279851 2.996269 263.111940 13.104478 400.518657 0.141791 19.010658

Observations and Insights on each cluster:

  • Cluster 0 has 467 observations. It corresponds to a segment of customers that have the highest mean in columns Income, MntWines, MntFruits, MntMeatProducts, MntFishProducts, MntSweetProducts, MntGoldProds, NumCatalogPurchases, NumStorePurchases, all AcceptedCamp columns, Response, Expenses, NumTotalPurchases, TotalAcceptedCmp, and AmountPerPurchase.
  • Cluster 1 has 587 observations. It corresponds to a segment of customers that has the highest mean in columns NumDealsPurchases, NumWebPurchases,
  • Cluster 2 has 302 observations. It corresponds to a segment of customers that has the highest mean in columns NumWebVisitsMonth, Complain, Engaged_in_days,
  • Cluster 3 has 603 observations. It corresponds to a segment of customers that has the highest mean in columns Year_Birth, and Kidhome.
  • Cluster 4 has 268 observations. It corresponds to a segment of customers that has the highest mean in columns Marital_Status, Teenhome, Recency, Age, Kids, Family_Size

From this information, I do not recognize any clear pattern that differentiate one cluster from another either. In fact, I think that the profiling qualities of each cluster don't make much sense, business-wise. Therefore I would not regard them as useful clusters.

Let's plot the boxplot

In [551]:
# Columns to use in boxplot
col_for_box = ['Income','Kidhome','Teenhome','Recency','MntWines','MntFruits','MntMeatProducts','MntFishProducts','MntSweetProducts','MntGoldProds','NumDealsPurchases','NumWebPurchases','NumCatalogPurchases','NumStorePurchases','NumWebVisitsMonth','Complain','Age','Family_Size','Expenses','NumTotalPurchases','Engaged_in_days','TotalAcceptedCmp','AmountPerPurchase']
In [552]:
# Create boxplot for each of the variables

all_col = col_for_box

plt.figure(figsize = (30, 50))

for i, variable in enumerate(all_col):
    plt.subplot(6, 4, i + 1)
    
    sns.boxplot(y=data[variable], x=data['K_medoids_segments'],showmeans=True)
    
    plt.tight_layout()
    
    plt.title(variable)

plt.show()

Characteristics of each cluster¶

The features where these clusters differ the most are:

Cluster 0 profile:

  • High-income customers, with an income median near to 80,000
  • No kids or teenagers living at home with them
  • They make a median of 55 days of recency
  • They are the main buyers of wine in the store, with a median of amount spent monthly of 600.
  • They are also the main buyers of fruit, meat, fish, sweet and gold products of all customers.
  • They don’t buy deals; they are the customers that buy most from catalogs and in store.
  • They are the ones that visit the website the least.
  • 50% of them are between 36 and 58 years old, with a median of 46 years old.
  • They are the customers that have spent the largest amount of money in products of the store and have the highest median of total number of purchases, next to cluster 1.
  • 50% of them have been engaged with the store between 400 and 760 days, with a median of 600 days.
  • They are the only segment that has accepted any campaign.
  • They have spent the largest amount per purchase, of all segments.

Summary for cluster 0: This cluster is made up of high-income customers, with no kids or teenagers living with them. They are the main buyers of wine, they purchase from catalog and in store and are the only segment that have accepted any campaign. They barely visit the website and do not buy deals. They have spent the largest amount of money in products of the store and have the highest median of total number of purchases.

Cluster 1 profile:

  • Medium-high income customers
  • No kids at home living with them, but teenagers
  • They are the second first segment that buys wine products the most, and they also are the second first buyers in all other products.
  • They buy a few deals, with a median of 2
  • They are the ones that buy from the website the most of all segments, although they mainly buy in store, second place from the website, and in the third place from catalog.
  • They visit the website monthly with a median of 5 visits.
  • 50% of them are between 42 and 60 years old, with a median of 50 years old.
  • Their expenses on store products have a median of 800, being the second largest amount of all segments.
  • They have not accepted any campaign.

Summary for cluster 1: This cluster is made up of medium-high income customers, with a teenager living at home with them. They are the second first segment that buys wine products the most. They are the ones that buy from the website the most of all segments, although they mainly buy in store, second place from the website, and in the third place from catalog. They have not accepted any campaign. The median of age here is 50 years old.

Cluster 2 profile:

  • Low-income customers, with a median of $35,000
  • With a kid and a teenager living with them at home
  • They have the lowest recency, with a median of 38 days since last purchase.
  • They buy wine in the first place, gold products in the second place and fish products in the third place.
  • They mainly buy in store, and in second place they buy from the website.
  • They buy deals with a median of 3, and they visit the website the most of all segments.
  • 50% of them are between 46 and 51 years old, with a median of 44 years old.
  • They make a median of 10 total number of purchases
  • They are the earliest clients, being engaged with the store with a minimum of 450 days and a median of 80 days.
  • They have not accepted any campaign.

Summary for cluster 2: This cluster is made up of low-income customers, living with a kid and a teenager at home. they buy wine in the first place, and mainly purchase in store. They visit the website the most of all segments, and buy deals too. They are the oldest clients and have not accepted any campaign.

Cluster 3 profile:

  • Low-income customers
  • No kids at home, but living with a teenager
  • Of all segments, they are the ones that buy the least of any product
  • They purchase a few deals, and mainly in store
  • They are the second first segment to visit the website the most every month
  • 50% of them are between 45 and 50 years old, with a median of 44 years old.
  • They have spent the least of all segments and make the lowest total number of purchases.
  • They are recent clients, been engaged with the store for a median of 440 days.
  • They have not accepted any campaign.

Summary for cluster 3: This cluster is made up of low-income customers, living with a teenager at home. They buy the least of any product. They purchase few deals and mainly in store. They are recent clients and they have not accepted any campaign.

Cluster 4 profile:

  • Medium-income customers, with a kid and a teenager living at home.
  • They make the largest number of days since last purchase, with a median of 72 days.
  • They mainly buy wine, followed by meat, gold and fish products.
  • They buy deals with a median of 3
  • They purchase in the first place from the website, and in second place in store.
  • 50% of them are between 42 and 59 years old, with a median of 51 years old.
  • They are recent clients, with a median of 400 days of being engaged with the store.
  • They have not accepted any campaign.

Summary for cluster 4: This cluster is made up of medium-income customers, living with a kid and a teenager at home. They mainly buy wine, followed by meat, gold and fish products. They purchase in the first place from the website, and in second place in store. They have not accepted any campaign, and are recent clients.

In [553]:
# Dropping labels we got from K-Medoids since we will be using PCA data for prediction
# Hint: Use axis=1 and inplace=True

data_pca.drop(['K_medoids_segments'], axis=1, inplace=True)
data.drop(['K_medoids_segments'], axis=1, inplace=True)

Hierarchical Clustering¶

Let's find the Cophenetic correlation for different distances with different linkage methods.

What is a Cophenetic correlation?¶

The cophenetic correlation coefficient is a correlation coefficient between the cophenetic distances (Dendrogramic distance) obtained from the tree, and the original distances used to construct the tree. It is a measure of how faithfully a dendrogram preserves the pairwise distances between the original unmodeled data points.

The cophenetic distance between two observations is represented in a dendrogram by the height of the link at which those two observations are first joined. That height is the distance between the two subclusters that are merged by that link.

Cophenetic correlation is the way to compare two or more dendrograms.

Let's calculate Cophenetic correlation for each of the distance metrics with each of the linkage methods

In [554]:
# list of distance metrics
distance_metrics = ["euclidean", "chebyshev", "mahalanobis", "cityblock"]

# list of linkage methods
linkage_methods = ["single", "complete", "average"]

high_cophenet_corr = 0                                                 # Creating a variable by assigning 0 to it
high_dm_lm = [0, 0]                                                    # Creating a list by assigning 0's to it

for dm in distance_metrics:
    for lm in linkage_methods:
        Z = linkage(data_pca, metric=dm, method=lm)                    # Applying different linkages with different distance on data_pca
        c, coph_dists = cophenet(Z, pdist(data_pca))                   # Calculating cophenetic correlation
        print(
            "Cophenetic correlation for {} distance and {} linkage is {}.".format(
                dm.capitalize(), lm, c
            )
        )
        if high_cophenet_corr < c:                                     # Checking if cophenetic correlation is higher than previous score
            high_cophenet_corr = c                                     # Appending to high_cophenet_corr list if it is higher
            high_dm_lm[0] = dm                                         # Appending its corresponding distance
            high_dm_lm[1] = lm                                         # Appending its corresponding method or linkage
Cophenetic correlation for Euclidean distance and single linkage is 0.8058309411978958.
Cophenetic correlation for Euclidean distance and complete linkage is 0.6646459995708378.
Cophenetic correlation for Euclidean distance and average linkage is 0.8649717378300196.
Cophenetic correlation for Chebyshev distance and single linkage is 0.7947453694045187.
Cophenetic correlation for Chebyshev distance and complete linkage is 0.6608218045726901.
Cophenetic correlation for Chebyshev distance and average linkage is 0.7834968281406554.
Cophenetic correlation for Mahalanobis distance and single linkage is 0.6280292997438838.
Cophenetic correlation for Mahalanobis distance and complete linkage is 0.5789238824021619.
Cophenetic correlation for Mahalanobis distance and average linkage is 0.6801758232330247.
Cophenetic correlation for Cityblock distance and single linkage is 0.8000297772063457.
Cophenetic correlation for Cityblock distance and complete linkage is 0.7205051270317547.
Cophenetic correlation for Cityblock distance and average linkage is 0.8602279087813257.
In [555]:
# Printing the combination of distance metric and linkage method with the highest cophenetic correlation
print(
    "Highest cophenetic correlation is {}, which is obtained with {} distance and {} linkage.".format(
        high_cophenet_corr, high_dm_lm[0].capitalize(), high_dm_lm[1]
    )
)
Highest cophenetic correlation is 0.8649717378300196, which is obtained with Euclidean distance and average linkage.

Let's have a look at the dendrograms for different linkages with Cityblock distance

In [556]:
# List of linkage methods
linkage_methods = ["single", "complete", "average"]

# Lists to save results of cophenetic correlation calculation
compare_cols = ["Linkage", "Cophenetic Coefficient"]

# To create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30))            # Setting the plot size

# We will enumerate through the list of linkage methods above
# For each linkage method, we will plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_methods):
    Z = linkage(data_pca, metric="Cityblock", method=method)                  # Measures the distances between two clusters

    dendrogram(Z, ax=axs[i])
    axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)")           # Title of dendrogram

    coph_corr, coph_dist = cophenet(Z, pdist(data_pca))                       # Finding cophenetic correlation for different linkages with city block distance
    axs[i].annotate(
        f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
        (0.80, 0.80),
        xycoords="axes fraction",
    )

Think about it:

  • Can we clearly decide the number of clusters based on where to cut the dendrogram horizontally?
  • What is the next step in obtaining number of clusters based on the dendrogram?

Let's have a look at the dendrograms for different linkages with Chebyshev distance

In [557]:
# Plot the dendrogram for Chebyshev distance with linkages single, complete and average. 
# Hint: Use Chebyshev distance as the metric in the linkage() function 

# List of linkage methods
linkage_methods = ["single", "complete", "average"]

# Lists to save results of cophenetic correlation calculation
compare_cols = ["Linkage", "Cophenetic Coefficient"]

# To create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30))            # Setting the plot size

# We will enumerate through the list of linkage methods above
# For each linkage method, we will plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_methods):
    Z = linkage(data_pca, metric="Chebyshev", method=method)                  # Measures the distances between two clusters

    dendrogram(Z, ax=axs[i])
    axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)")           # Title of dendrogram

    coph_corr, coph_dist = cophenet(Z, pdist(data_pca))                       # Finding cophenetic correlation for different linkages with city block distance
    axs[i].annotate(
        f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
        (0.80, 0.80),
        xycoords="axes fraction",
    )

Let's have a look at the dendrograms for different linkages with Mahalanobis distance

In [558]:
# Plot the dendrogram for Mahalanobis distance with linkages single, complete and average. 
# Hint: Use Mahalanobis distance as the metric in the linkage() function 

# List of linkage methods
linkage_methods = ["single", "complete", "average"]

# Lists to save results of cophenetic correlation calculation
compare_cols = ["Linkage", "Cophenetic Coefficient"]

# To create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30))            # Setting the plot size

# We will enumerate through the list of linkage methods above
# For each linkage method, we will plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_methods):
    Z = linkage(data_pca, metric="Mahalanobis", method=method)                  # Measures the distances between two clusters

    dendrogram(Z, ax=axs[i])
    axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)")           # Title of dendrogram

    coph_corr, coph_dist = cophenet(Z, pdist(data_pca))                       # Finding cophenetic correlation for different linkages with city block distance
    axs[i].annotate(
        f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
        (0.80, 0.80),
        xycoords="axes fraction",
    )

Let's have a look at the dendrograms for different linkages with Euclidean distance

In [559]:
# Plot the dendrogram for Euclidean distance with linkages single, complete, average and ward. 
# Hint: Use Euclidean distance as the metric in the linkage() function 

# List of linkage methods
linkage_methods = ["single", "complete", "average"]

# Lists to save results of cophenetic correlation calculation
compare_cols = ["Linkage", "Cophenetic Coefficient"]

# To create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30))            # Setting the plot size

# We will enumerate through the list of linkage methods above
# For each linkage method, we will plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_methods):
    Z = linkage(data_pca, metric="Euclidean", method=method)                  # Measures the distances between two clusters

    dendrogram(Z, ax=axs[i])
    axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)")           # Title of dendrogram

    coph_corr, coph_dist = cophenet(Z, pdist(data_pca))                       # Finding cophenetic correlation for different linkages with city block distance
    axs[i].annotate(
        f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
        (0.80, 0.80),
        xycoords="axes fraction",
    )

Think about it:

  • Are there any distinct clusters in any of the dendrograms?

Observations and Insights about all dendograms: I do not see that this model shows anything that could be really helpful, in addition to the other models I have tried so far. Many of the dendrograms are odd and difficult to distinguish. In most cases, the dendrograms are too many, with very few observations in each one of them.

In [560]:
data_pca.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2227 entries, 0 to 2226
Data columns (total 17 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       2227 non-null   float64
 1   1       2227 non-null   float64
 2   2       2227 non-null   float64
 3   3       2227 non-null   float64
 4   4       2227 non-null   float64
 5   5       2227 non-null   float64
 6   6       2227 non-null   float64
 7   7       2227 non-null   float64
 8   8       2227 non-null   float64
 9   9       2227 non-null   float64
 10  10      2227 non-null   float64
 11  11      2227 non-null   float64
 12  12      2227 non-null   float64
 13  13      2227 non-null   float64
 14  14      2227 non-null   float64
 15  15      2227 non-null   float64
 16  16      2227 non-null   float64
dtypes: float64(17)
memory usage: 295.9 KB
In [561]:
# Initialize Agglomerative Clustering with affinity (distance) as Euclidean, linkage as 'Ward' with clusters=3
HCmodel = AgglomerativeClustering(n_clusters= 3, affinity = 'euclidean', linkage='ward').fit_predict(data_pca)

labels = HCmodel   

# Calculate silhouette score - Hint: Use silhouette_score() function
score = silhouette_score(data_pca, labels)  
print(score)
                                                                   
0.23452261634656826
In [562]:
# Add Agglomerative Clustering cluster labels to data_pca

data_pca["HC_segments"] = labels

# Add Agglomerative Clustering cluster labels to the whole data

data["HC_segments"] = labels

# Add Agglomerative Clustering cluster labels to data_model

data_model["HC_segments"] = labels
In [563]:
# Let's check the distribution
data_model['HC_segments'].value_counts()
Out[563]:
1    909
2    807
0    511
Name: HC_segments, dtype: int64

Let's visualize the clusters using PCA.

In [564]:
# Hint: Use PCA_PLOT function created above
def PCA_PLOT(X, Y, PCA, cluster):
    sns.scatterplot(x=X, y=1, data=PCA, hue=cluster)
In [565]:
PCA_PLOT(0, 1, data_pca, "HC_segments")

Observations and Insights: About the distribution of observations: We can observe that these clusters are not well separated, they don't have good density and are overlapped. Although they are well balanced in terms of number of observations in each cluster, the fact that they do not differentiate very well from each other is not very helpful.

Cluster Profiling¶

In [566]:
# Take the cluster-wise mean of all the variables. Hint: First group 'data' by cluster labels column and then find mean
cluster_profile_HC = data.groupby('HC_segments').mean()
In [567]:
# Highlight the maximum average value among all the clusters for each of the variables
cluster_profile_HC.style.highlight_max(color="lightgreen", axis=0)
Out[567]:
  Year_Birth Marital_Status Income Kidhome Teenhome Recency MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Response Age Kids Family_Size Expenses NumTotalPurchases Engaged_in_days TotalAcceptedCmp AmountPerPurchase
HC_segments                                                                
0 1968.450098 1.612524 76117.052838 0.033268 0.203523 51.561644 639.727984 70.739726 463.763209 101.354207 71.575342 77.469667 1.289628 5.111546 5.722114 8.082192 2.827789 0.086106 0.154599 0.299413 0.230920 0.039139 0.005871 0.299413 47.549902 0.236791 1.849315 1424.630137 20.205479 544.005871 0.810176 75.929317
1 1971.705171 1.646865 33512.654565 0.809681 0.449945 48.772277 30.623762 3.976898 18.013201 5.668867 4.207921 12.829483 1.869087 1.896590 0.415842 3.033003 6.500550 0.061606 0.005501 0.000000 0.000000 0.002200 0.012101 0.085809 44.294829 1.259626 2.906491 75.320132 7.214521 502.947195 0.069307 9.830857
2 1965.993804 1.662949 56618.462206 0.293680 0.765799 47.955390 402.830235 23.501859 142.576208 33.406444 24.960347 58.335812 3.485750 5.951673 3.190830 7.520446 5.586121 0.078067 0.102850 0.011152 0.032218 0.008674 0.007435 0.127633 50.006196 1.059480 2.722429 685.610905 20.148699 574.270136 0.232962 32.671213

Observations and Insights on each cluster:

  • Cluster 0 has 511 observations. It corresponds to a segment of customers that have the highest mean in columns Income, Recency, in all types of product columns, NumCatalogPurchases, NumStorePurchases, in all AcceptedCmp columns, Response, Expenses, NumTotalPurchases, TotalAcceptedCmp, and AmountPerPurchase.
  • Cluster 1 has 909 observations. It corresponds to a segment of customers that has the highest mean in columns Year_Birth, Kidhome, NumWebVisitsMonth, Complain, Kids, Family_Size.
  • Cluster 2 has 807 observations. It corresponds to a segment of customers that has the highest mean in columns Marital_Status, Teenhome, NumDealsPurchases, NumWebPurchases, Age, and Engaged_in_days.

Again, as in the last model, from this information I do not recognize any clear pattern that differentiate one cluster from another either. I think that the profiling qualities of each cluster don't make much sense either, business-wise.

Let's plot the boxplot

In [568]:
# Create boxplot for each of the variables

all_col = col_for_box

plt.figure(figsize = (30, 50))

for i, variable in enumerate(all_col):
    plt.subplot(6, 4, i + 1)
    
    sns.boxplot(y=data[variable], x=data['HC_segments'],showmeans=True)
    
    plt.tight_layout()
    
    plt.title(variable)

plt.show()

Characteristics of each cluster¶

Cluster 0:

  • High-income customers
  • Living with no kid or a teenager at home
  • They have the highest median on recency, with a median of 56 days since last purchase.
  • They are the segment that buys wines the most, with a median of almost $600
  • The buy wines the most, followed by meat, fish, gold, sweets and fruits products.
  • They do not buy deals
  • They buy mainly in store, followed by catalog and web purchases
  • They visit the website the least of all segments
  • 50% of them are between 38 and 58 years old, with a median of 47 years old
  • They make the largest number of purchases of all segments
  • They have been engaged with the store with a median of 560 days aprox.
  • They are the only segment that accepted a campaign.
  • They are the segment that has spent the largest amount per purchase of all segments.

Summary for cluster 0: This cluster is made up of high-income customers, living with no kid or teenager at home. The buy wines the most and do not buy deals. They buy mainly in store, followed by catalog and web purchases. They are a median of 47 years old. They are the only segment that accepted a campaign. They are the segment that has spent the largest amount per purchase of all segments.

Cluster 1:

  • Low-income customers
  • No kid at home, but a teenager living with them
  • They buy very little of wine, followed by gold, fish, fruit, meat and sweet products.
  • They buy a few deals and purchase from the website. In the third place they buy from catalog.
  • They do not buy in store.
  • They are the ones that visit the website the most
  • They are the youngest clients, 50% of them are between 37 and 51 years old, with a median of 44 years old.
  • They are the segment that spend the least and make the smallest total number of purchases, and amount per purchase.
  • They are the latest clients, being engaged with the store with a median of 500 days.
  • They have not accepted any campaign.

Summary for cluster 1: This cluster is made up of low-income customers, living with a teenager at home. They are the segment that spend the least and make the smallest total number of purchases, and amount per purchase. They are the latest clients, being engaged with the store with a median of 500 days. They have not accepted any campaign.

Cluster 2:

  • Medium-income customers, living with a kid and a teenager at home
  • They mainly purchase wine, followed by meat, gold, fish, sweets and fruit products.
  • They buy deals, with a median of 3
  • They mainly buy in store, followed by web purchases and catalog purchases in the third place.
  • They visit the website frequently, with a median of 6 times per month.
  • They are the oldest customers, 50% of them are between 42 and 59 years old, with a median of 50 years old.
  • They make the largest number of total purchases
  • They are the earliest customers, engaged with the store with a median of almost 600 days.
  • They have not accepted any campaign.

Summary for cluster 2: This cluster is made up of medium-income customers, living with a kid and a teenager at home. They mainly purchase wine, purchase in store and visit the website frequently. They make the largest number of total purchases, and are the earliest customers, engaged with the store with a median of almost 600 days. They have not accepted any campaign.

Observations and Insights:

In [569]:
# Dropping labels we got from Agglomerative Clustering since we will be using PCA data for prediction
# Hint: Use axis=1 and inplace=True

data_pca.drop(['HC_segments'], axis=1, inplace=True)
data.drop(['HC_segments'], axis=1, inplace=True)

DBSCAN¶

DBSCAN is a very powerful algorithm for finding high-density clusters, but the problem is determining the best set of hyperparameters to use with it. It includes two hyperparameters, eps, and min samples.

Since it is an unsupervised algorithm, you have no control over it, unlike a supervised learning algorithm, which allows you to test your algorithm on a validation set. The approach we can follow is basically trying out a bunch of different combinations of values and finding the silhouette score for each of them.

In [570]:
# Initializing lists
eps_value = [2,3]                       # Taking random eps value
min_sample_values = [6,20]              # Taking random min_sample value

# Creating a dictionary for each of the values in eps_value with min_sample_values
res = {eps_value[i]: min_sample_values for i in range(len(eps_value))}  
In [571]:
# Finding the silhouette_score for each of the combinations
high_silhouette_avg = 0                                               # Assigning 0 to the high_silhouette_avg variable
high_i_j = [0, 0]                                                      # Assigning 0's to the high_i_j list
       
               
key = res.keys()                                                      # Assigning dictionary keys to a variable called key
for i in key:
    z = res[i]                                                        # Assigning dictionary values of each i to z
    for j in z:
        db = DBSCAN(eps=i, min_samples=j).fit(data_pca)               # Applying DBSCAN to each of the combination in dictionary
        core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
        core_samples_mask[db.core_sample_indices_] = True
        labels = db.labels_
        silhouette_avg = silhouette_score(data_pca, labels)           # Finding silhouette score 
        print( 
            "For eps value =" + str(i),
            "For min sample =" + str(j),
            "The average silhoutte_score is :",
            silhouette_avg,                                          # Printing the silhouette score for each of the combinations
        )
        if high_silhouette_avg < silhouette_avg:                     # If the silhouette score is greater than 0 or the previous score, it will get appended to the high_silhouette_avg list with its combination of i and j              
            high_i_j[0] = i
            high_i_j[1] = j
For eps value =2 For min sample =6 The average silhoutte_score is : 0.19623335001342507
For eps value =2 For min sample =20 The average silhoutte_score is : 0.3381054125658042
For eps value =3 For min sample =6 The average silhoutte_score is : 0.20036243530070832
For eps value =3 For min sample =20 The average silhoutte_score is : 0.341052707104241
In [572]:
# Printing the highest silhouette score
print("Highest_silhoutte_avg is {} for eps = {} and min sample = {}".format(high_silhouette_avg, high_i_j[0], high_i_j[1]))
Highest_silhoutte_avg is 0 for eps = 3 and min sample = 20
In [573]:
# Initializing lists
eps_value = [1,2]                       # Taking random eps value
min_sample_values = [10,33]              # Taking random min_sample value

# Creating a dictionary for each of the values in eps_value with min_sample_values
res = {eps_value[i]: min_sample_values for i in range(len(eps_value))}  
In [574]:
# Finding the silhouette_score for each of the combinations
high_silhouette_avg = 0                                               # Assigning 0 to the high_silhouette_avg variable
high_i_j = [0, 0]                                                      # Assigning 0's to the high_i_j list
       
               
key = res.keys()                                                      # Assigning dictionary keys to a variable called key
for i in key:
    z = res[i]                                                        # Assigning dictionary values of each i to z
    for j in z:
        db = DBSCAN(eps=i, min_samples=j).fit(data_pca)               # Applying DBSCAN to each of the combination in dictionary
        core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
        core_samples_mask[db.core_sample_indices_] = True
        labels = db.labels_
        silhouette_avg = silhouette_score(data_pca, labels)           # Finding silhouette score 
        print( 
            "For eps value =" + str(i),
            "For min sample =" + str(j),
            "The average silhoutte_score is :",
            silhouette_avg,                                          # Printing the silhouette score for each of the combinations
        )
        if high_silhouette_avg < silhouette_avg:                     # If the silhouette score is greater than 0 or the previous score, it will get appended to the high_silhouette_avg list with its combination of i and j              
            high_i_j[0] = i
            high_i_j[1] = j
For eps value =1 For min sample =10 The average silhoutte_score is : 0.025643772778777187
For eps value =1 For min sample =33 The average silhoutte_score is : 0.1031085655564581
For eps value =2 For min sample =10 The average silhoutte_score is : 0.33983556606490745
For eps value =2 For min sample =33 The average silhoutte_score is : 0.33536290923972917
In [575]:
# Initializing lists
eps_value = [1,2]                       # Taking random eps value
min_sample_values = [44,70]              # Taking random min_sample value

# Creating a dictionary for each of the values in eps_value with min_sample_values
res = {eps_value[i]: min_sample_values for i in range(len(eps_value))}  
In [576]:
# Finding the silhouette_score for each of the combinations
high_silhouette_avg = 0                                               # Assigning 0 to the high_silhouette_avg variable
high_i_j = [0, 0]                                                      # Assigning 0's to the high_i_j list
       
               
key = res.keys()                                                      # Assigning dictionary keys to a variable called key
for i in key:
    z = res[i]                                                        # Assigning dictionary values of each i to z
    for j in z:
        db = DBSCAN(eps=i, min_samples=j).fit(data_pca)               # Applying DBSCAN to each of the combination in dictionary
        core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
        core_samples_mask[db.core_sample_indices_] = True
        labels = db.labels_
        silhouette_avg = silhouette_score(data_pca, labels)           # Finding silhouette score 
        print( 
            "For eps value =" + str(i),
            "For min sample =" + str(j),
            "The average silhoutte_score is :",
            silhouette_avg,                                          # Printing the silhouette score for each of the combinations
        )
        if high_silhouette_avg < silhouette_avg:                     # If the silhouette score is greater than 0 or the previous score, it will get appended to the high_silhouette_avg list with its combination of i and j              
            high_i_j[0] = i
            high_i_j[1] = j
For eps value =1 For min sample =44 The average silhoutte_score is : 0.05684806462872944
For eps value =1 For min sample =70 The average silhoutte_score is : -0.09151027357324977
For eps value =2 For min sample =44 The average silhoutte_score is : 0.3341822157081539
For eps value =2 For min sample =70 The average silhoutte_score is : 0.3272339012067853

Now, let's apply DBSCAN using the hyperparameter values we have received above.

In [577]:
# Apply DBSCAN using the above hyperparameter values
dbs = DBSCAN(eps = 3, min_samples= 20)
In [578]:
# fit_predict on data_pca and add DBSCAN cluster labels to the whole data
data['DBS_segments'] = dbs.fit_predict(data_pca)

# fit_predict on data_pca and add DBSCAN cluster labels to data_model

data_model['DBS_segments'] = dbs.fit_predict(data_pca)

# fit_predict on data_pca and add DBSCAN cluster labels to data_pca

data_pca['DBS_segments'] = dbs.fit_predict(data_pca)
In [579]:
# Let's check the distribution
data_pca["DBS_segments"].value_counts()
Out[579]:
 0    1912
-1     315
Name: DBS_segments, dtype: int64

Let's visualize the clusters using PCA.

In [580]:
# Hint: Use PCA_PLOT function created above
def PCA_PLOT(X, Y, PCA, cluster):
    sns.scatterplot(x=X, y=1, data=PCA, hue=cluster)
In [581]:
PCA_PLOT(0, 1, data_pca, "DBS_segments")

Observations and Insights: We can observe from this scatter plot that there are two clusters, but they are not well separated, and the density of cluster -1 is very low. So, this solution doesn't help to find good insights. Let's try with the next pair of numbers for parameters eps and min_samples I tried before that got a reasonable Silhouette score.

In [582]:
# Apply DBSCAN using the above hyperparameter values
dbs = DBSCAN(eps = 2, min_samples= 10)
In [583]:
# fit_predict on data_pca and add DBSCAN cluster labels to the whole data
data['DBS_segments'] = dbs.fit_predict(data_pca)

# fit_predict on data_pca and add DBSCAN cluster labels to data_model

data_model['DBS_segments'] = dbs.fit_predict(data_pca)

# fit_predict on data_pca and add DBSCAN cluster labels to data_pca

data_pca['DBS_segments'] = dbs.fit_predict(data_pca)
In [584]:
# Let's check the distribution
data_pca["DBS_segments"].value_counts()
Out[584]:
 0    1344
-1     883
Name: DBS_segments, dtype: int64

Let's visualize the clusters using PCA.

In [585]:
# Hint: Use PCA_PLOT function created above
def PCA_PLOT(X, Y, PCA, cluster):
    sns.scatterplot(x=X, y=1, data=PCA, hue=cluster)
In [586]:
PCA_PLOT(0, 1, data_pca, "DBS_segments")

Observations and Insights: Here the two cluster are much better defined, although they still overlapped a little bit. The observationes are distribuited in a more balanced way than before. Let's try with the third pair of numbers for parameters eps and min_samples.

In [587]:
# Apply DBSCAN using the above hyperparameter values
dbs = DBSCAN(eps = 2, min_samples= 44)
In [588]:
# fit_predict on data_pca and add DBSCAN cluster labels to the whole data
data['DBS_segments'] = dbs.fit_predict(data_pca)

# fit_predict on data_pca and add DBSCAN cluster labels to data_model

data_model['DBS_segments'] = dbs.fit_predict(data_pca)

# fit_predict on data_pca and add DBSCAN cluster labels to data_pca

data_pca['DBS_segments'] = dbs.fit_predict(data_pca)
In [589]:
# Let's check the distribution
data_pca["DBS_segments"].value_counts()
Out[589]:
 0    1225
-1    1002
Name: DBS_segments, dtype: int64

Let's visualize the clusters using PCA.

In [590]:
# Hint: Use PCA_PLOT function created above
def PCA_PLOT(X, Y, PCA, cluster):
    sns.scatterplot(x=X, y=1, data=PCA, hue=cluster)
In [591]:
PCA_PLOT(0, 1, data_pca, "DBS_segments")

Observations and Insights: Of these three solutions, I like the last one the best. It seems it has a little less overlap than the last solution, but still very similar patterns. I will profile the clusters of this solution to find out more about them.

Cluster Profiling¶

In [592]:
# Take the cluster-wise mean of all the variables. Hint: First group 'data' by cluster labels column and then find mean
cluster_profile_DBSCAN = data.groupby('DBS_segments').mean()
In [593]:
# Highlight the maximum average value among all the clusters for each of the variables
cluster_profile_DBSCAN.style.highlight_max(color="lightgreen", axis=0)
Out[593]:
  Year_Birth Marital_Status Income Kidhome Teenhome Recency MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Response Age Kids Family_Size Expenses NumTotalPurchases Engaged_in_days TotalAcceptedCmp AmountPerPurchase
DBS_segments                                                                
-1 1967.270459 1.630739 68576.822355 0.126747 0.467066 49.382236 582.637725 51.186627 326.831337 73.197605 52.981038 75.341317 2.354291 5.852295 4.928144 8.298403 4.175649 0.092814 0.146707 0.161677 0.141717 0.026946 0.007984 0.225549 48.729541 0.593812 2.224551 1162.175649 21.433134 568.909182 0.569860 58.189258
0 1970.212245 1.656327 37825.257143 0.704490 0.541224 48.898776 78.382857 6.073469 33.413061 8.620408 6.086531 18.640000 2.295510 2.673469 0.766531 3.788571 6.267755 0.057143 0.016327 0.000000 0.001633 0.001633 0.009796 0.088163 45.787755 1.245714 2.902041 151.216327 9.524082 513.106122 0.076735 12.894826

Observations and Insights on each cluster:

  • Cluster -1 has 1002 observations. It corresponds to a segment of customers that have the highest mean in columns Income, Recency, MntWines, MntFruits, MntMeatProducts, MntFishProducts, MntSweetProducts, MntGoldProds, NumDealsPurchases, NumWebPurchases, NumCatalogPurchases, NumStorePurchases, all AcceptedCmp columns, Response, Age, Expenses, NumTotalPurchases, Engaged_in_days, TotalAcceptedCmp, AmountPerPurchase.

  • Cluster 0 has 1225 observations. It corresponds to a segment of customers that have the highest mean in columns Year_Birth, Marital_Status, Kidhome, Teenhome, NumWebVisitsMonth, Complain, Kids, Family_Size.

Let's plot the boxplot

In [594]:
# Create boxplot for each of the variables

all_col = col_for_box

plt.figure(figsize = (30, 50))

for i, variable in enumerate(all_col):
    plt.subplot(6, 4, i + 1)
    
    sns.boxplot(y=data[variable], x=data['DBS_segments'],showmeans=True)
    
    plt.tight_layout()
    
    plt.title(variable)

plt.show()

Characteristics of each cluster¶

The features where these clusters differ the most are:

Cluster -1:

  • High-income customers
  • Living with no kid at home, but a teenager
  • Buy much more than the other cluster in all products.
  • Buy deals, and purchase mainly in store, followed by web purchases and catalog
  • They are the oldest customers, with a median of 48 years old.
  • They spend the most with a median of 1,100, and make the largest total number of purchases.
  • They are the earliest clients, 50% of them have been engaged with the store between 400 and 760 days.
  • They have accepted a campaign, and make the largest amount per purchase.

Summary for cluster -1: This cluster is made up of high-income customers, living with no kid but a teenager at home. They buy more of all products than the second cluster, and purchase mainly in store, followed by web purchases. They are the oldest segment, with a median of 48 years old. They have accepted a campaign and make the largest amount per purchase.

Cluster 0:

  • Low-income customers
  • Living with a kid and a teenager at home
  • They mainly buy wine, followed by meat, then gold products.
  • They buy deals
  • They buy as much from the website as in store
  • They visit the website the most every month
  • They are a median of 44 years old
  • They spent much less than the other segment, and have a lower number of total purchases, with a median of 8 (versus a median of 21 in the other segment)
  • They have not accepted any campaign.

Summary for cluster 0: This cluster is made up of low-income customers, living with a kid but a teenager at home. They buy less of all products than the second cluster, and purchase mainly wine, followed by meat. They buy in store, followed by web purchases. They visit the website often. They spent much less than the other segment, and have a lower number of total purchases, with a median of 8 (versus a median of 21 in the other segment). They are a median of 44 years old. They have not accepted any campaign.

Insights: Although from the scatter plot this solution seemed worthy of exploring, it does not add anything that makes the clusters better defined than what the models we applied before showed.

In [595]:
# Dropping labels we got from DBSCAN since we will be using PCA data for prediction
# Hint: Use axis=1 and inplace=True
data_pca.drop(['DBS_segments'], axis=1, inplace=True)
data.drop(['DBS_segments'], axis=1, inplace=True)

Gaussian Mixture Model¶

Let's find the silhouette score for K=5 in Gaussian Mixture

In [596]:
gmm = GaussianMixture(n_components = 5, random_state = 1)  # Initialize Gaussian Mixture Model with number of clusters as 5 and random_state=1
   
preds = gmm.fit_predict(data_pca)   # Fit and predict Gaussian Mixture Model using data_pca  

score = silhouette_score(data_pca, preds)  # Calculate the silhouette score
            
print(score)               # Print the score
0.09112454908758941

Observations and Insights: The Silhouette score it is very low. In fact, is the lowest of all we have gotten before in this project.

In [597]:
# Predicting on data_pca and add Gaussian Mixture Model cluster labels to the whole data

data['GMM_segments'] = preds

# Predicting on data_pca and add Gaussian Mixture Model cluster labels to data_model

data_model['GMM_segments'] = preds

# Predicting on data_pca and add Gaussian Mixture Model cluster labels to data_pca

data_pca['GMM_segments'] = preds
In [598]:
# Let's check the distribution
data_model['GMM_segments'].value_counts()
Out[598]:
1    692
4    579
2    468
0    454
3     34
Name: GMM_segments, dtype: int64

Let's visualize the clusters using PCA.

In [599]:
# Hint: Use PCA_PLOT function created above
def PCA_PLOT(X, Y, PCA, cluster):
    sns.scatterplot(x=X, y=1, data=PCA, hue=cluster)
In [600]:
PCA_PLOT(0, 1, data_pca, "GMM_segments")

Observations and Insights: This scatter plot shows clusters very messy and overlapped. Is not possible to observe any pattern in the data that could be helpful for this business problem.

Cluster Profiling¶

In [601]:
# Take the cluster-wise mean of all the variables. Hint: First group 'data' by cluster labels column and then find mean
cluster_profile_GMM = data.groupby('GMM_segments').mean()
In [602]:
# Highlight the maximum average value among all the clusters for each of the variables
cluster_profile_GMM.style.highlight_max(color="lightgreen", axis=0)
Out[602]:
  Year_Birth Marital_Status Income Kidhome Teenhome Recency MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Response Age Kids Family_Size Expenses NumTotalPurchases Engaged_in_days TotalAcceptedCmp AmountPerPurchase
GMM_segments                                                                
0 1966.048458 1.669604 59928.936123 0.204846 0.777533 48.715859 416.317181 38.557269 161.676211 51.149780 38.832599 65.808370 3.314978 6.676211 3.814978 9.059471 5.079295 0.066079 0.081498 0.008811 0.011013 0.004405 0.011013 0.085903 49.951542 0.982379 2.651982 772.341410 22.865639 562.374449 0.171806 33.257678
1 1972.228324 1.650289 31642.988439 0.868497 0.436416 48.234104 17.771676 2.763006 11.744220 4.401734 2.927746 9.026012 1.736994 1.563584 0.258671 2.864162 6.390173 0.075145 0.002890 0.000000 0.000000 0.002890 0.014451 0.069364 43.771676 1.304913 2.955202 48.634393 6.423410 477.748555 0.080925 7.425567
2 1968.023504 1.660256 44576.096154 0.549145 0.655983 49.549145 139.200855 8.504274 52.175214 11.405983 8.209402 32.623932 3.049145 3.952991 1.305556 4.510684 6.297009 0.055556 0.049145 0.000000 0.014957 0.000000 0.006410 0.138889 47.976496 1.205128 2.865385 252.119658 12.818376 560.602564 0.119658 18.997444
3 1972.176471 1.647059 66295.176471 0.088235 0.235294 49.500000 624.676471 52.911765 341.764706 55.588235 56.529412 78.470588 0.882353 3.705882 4.058824 4.764706 5.558824 0.235294 0.382353 0.588235 0.470588 0.205882 0.000000 0.500000 43.823529 0.323529 1.970588 1209.941176 13.411765 600.176471 1.882353 138.326402
4 1967.630397 1.606218 73971.644214 0.062176 0.278066 50.112263 677.255613 57.915371 433.246978 87.060449 60.661485 76.454231 1.739206 5.267703 5.556131 7.924007 3.450777 0.081174 0.158895 0.238342 0.200345 0.031088 0.003454 0.284974 48.369603 0.340242 1.946459 1392.594128 20.487047 569.799655 0.709845 69.551674

Observations and Insights on each cluster:

  • Cluster 0 has 454 observations. It corresponds to a segment of customers that have the highest mean in columns Marital_Status, Teenhome, NumDealsPurchases, NumWebPurchases, NumStorePurchases, Age, NumTotalPurchases.
  • Cluster 1 has 692 observations. It corresponds to a segment of customers that have the highest mean in columns Year_Birth, Kidhome, NumWebVisitsMonth, Complain, Kids, Family_Size,
  • Cluster 2 has 468 observations. It corresponds to a segment of customers do not have any highest mean in any column.
  • Cluster 3 has 34 observations. It corresponds to a segment of customers that have the highest mean in columns MntGoldProds, all AcceptedCamp columns, Response, Engaged_in_days, TotalAcceptedCmp, AmountPerPurchase.
  • Cluster 4 has 579 observations. It corresponds to a segment of customers that have the highest mean in columns Income, Recency, MntWines, MntFruits, MntMeatProducts, MntFishProducts, MntSweetProducts, NumCatalogPurchases, Expenses.

Let's plot the boxplot

In [603]:
# Create boxplot for each of the variables

all_col = col_for_box

plt.figure(figsize = (30, 50))

for i, variable in enumerate(all_col):
    plt.subplot(6, 4, i + 1)
    
    sns.boxplot(y=data[variable], x=data['GMM_segments'],showmeans=True)
    
    plt.tight_layout()
    
    plt.title(variable)

plt.show()

Characteristics of each cluster¶

The features where these clusters differ the most are:

Cluster 0:

  • Medium-income customers
  • With a teenager at home living with them
  • They mainly buy wine in a median of 400, followed by meat, gold, fish and fruit products.
  • They buy deals
  • They first buy in store, followed by web purchases and then catalog purchases. They in fact, make the greatest number of in store purchases of all segments
  • They visit the website frequently, with a median of 5 visits per month
  • They are the oldest client with a median of 50 years old
  • They make the largest total number pf purchases of all segments, the 50% of them make between 20 and 26 purchases.
  • They have been engaged with the store with a median of almost 600 days
  • They have not accepted any campaign.

Cluster 1:

  • Low-income customers
  • With a teenager at home living with them
  • They buy the least products quantity of all segments
  • When they buy, they do it mostly in store and from the website
  • They visit the website often, with a median of 7 visits per month
  • They make the least total number of purchases and amount of expenses.
  • They are the latest customers, 50% of them have been engaged with the store between 300 and 650 days.
  • They have not accepted any campaign.

Cluster 2:

  • Medium-low income customers
  • With a kid and a teenager living with them at home
  • Mainly buy wine, followed by meat, gold and fish.
  • They buy deals
  • They buy as much in store as they do website purchases
  • They visit the website often with a median of 7 visits per month
  • They are a median of 47 years old
  • 50% of them make a total number of purchases between 10 and 16.
  • They have been engaged with the store with a median of 560 days
  • They have not accepted any campaign.

Cluster 3:

  • High-income customers
  • With no kid or teenager living at home
  • They buy the largest number of wines monthly of all segments, followed by gold products.
  • They barely buy deals
  • They buy mainly from catalog, then in store and website
  • They are a median of 1 years old
  • They make the largest amount of expenses of all segments
  • They have been engaged with the store with a median of 600 days
  • They have accepted last campaigns, with a median of 2.

Cluster 4:

  • Medium-High income customers
  • With no kid, but a teenager living at home
  • They buy mainly wine products, followed by meat, fish and gold, then fruits and sweet products.
  • They buy first in store, then website purchases, followed by catalog purchases.
  • They are the segment that visits the website the least.
  • They have been engaged with the store with a median of 600 days
  • They have accepted 1 campaign

Aditional Insights from these clusters: These clusters profiles differ in many ways from the clusters we have seen in the models before. I think these cluster do not help in getting a better picture of the differences between them.

Conclusion and Recommendations¶

  • Refined Insights: What are the most meaningful insights from the data relevant to the problem?
  1. Executive summary The marketing team approached our data science team asking for help, because their efforts have been unsuccessful. The goal was to find the best possible customer segmentation, using the customer base they provided. This database was 2.240 entries and 27 columns originally. After exploring and engineering the data, and applying the models, some interesting key takeaways have been found, that are conveyed in the proposed solution as follows.

  2. Proposed solution After applying different models, four customer segments were found. They are being chosen because is the only solution that separates the customers based on their purchase behavior (type of product they buy and amount spent in total) and divides them into four clear clusters regarding number of observations each one has and the quality of their profiling.

These for clusters are: Cluster 0:

  • 1003 low-income customers
  • Buy the least of all segments in any product.
  • Mainly buy in store
  • Spent the least amount of money.
  • Had not accepted any campaign Cluster 1:
  • 468 medium-high income customers
  • Buy greatest quantities of fish, fruits, sweets and gold products of all segments
  • Mainly buy in store, followed by catalog.
  • Largest number of total purchases, and second on amount spent.
  • They had not accepted any campaign Cluster 2:
  • 558 medium-income customers
  • The product the buy the most is wine
  • Buy deals more than all other segments
  • Mainly buy in store and web purchases
  • Had not accepted any campaign Cluster 3:
  • 198 high-income customers
  • Buy more wine than all the others
  • Mainly buy from catalog, followed by in store
  • Make the largest amount of total expenses of all
  • Had accepted a campaign

Reasons for choosing this segmentation solution:

  • Assigned products to different clusters: this divides the products in at least two different groups and assign them to different clusters. This way is easy to target different segments based on their purchase behavior.
  • Number of observations distributed in a fair balanced proportion.
  • The profiling is relatively clear on their differences, so it is useful
  • Having for segments is a good number to work with business-wise. It is large enough to apply different action plans and diversify the investments but also, they are a number easy to manage if they are just starting to apply segmentation strategies.

How did I get to this solution?

  • Exploratory analysis of the database (2240 customers, 27 columns)
  • Feature engineering: working through the columns to merge, drop some of them.
  • Preparing the data base: scaling and standardization
  • Application of the following algorithms, and analysis of their solutions: K-Means, K-Medoids, Hierarchical clustering, DBSCAN and Gaussian Mixture Model.

Model chosen: I chose K-Means, K=4, because it was the only algorithm that provided a reasonable number of cluster with a reasonable number of observations on each one of them, and that gave a solution where two different groups of products were alocated in different clusters. I think this is crucial because it is an essential part of the customers' purchase behavior. All the other solutions alocated all the products to one cluster, loosing this edge. Except for K-Means, K=5, but I did not choose that one because the distribution of the clusters where less useful than K=4, in profiling and practical terms.

Recommendation for implementation:

  • Test this solution during a period of time.
  • Try different marketing campaigns and other action plans, applying these segments definition.
  • Focus on tracking how the customers purchase behavior changes regarding products, channels and amount spent.
  • Track recency, to add it up to the next iteration of this solution.
  • Define a date to apply these models again to the new data.